CN105956158A - Automatic extraction method of network neologism on the basis of mass microblog texts and use information - Google Patents

Automatic extraction method of network neologism on the basis of mass microblog texts and use information Download PDF

Info

Publication number
CN105956158A
CN105956158A CN201610324541.9A CN201610324541A CN105956158A CN 105956158 A CN105956158 A CN 105956158A CN 201610324541 A CN201610324541 A CN 201610324541A CN 105956158 A CN105956158 A CN 105956158A
Authority
CN
China
Prior art keywords
neologisms
word
text
list
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610324541.9A
Other languages
Chinese (zh)
Other versions
CN105956158B (en
Inventor
黄永峰
吴方照
刘佳伟
袁志刚
吴思行
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN201610324541.9A priority Critical patent/CN105956158B/en
Publication of CN105956158A publication Critical patent/CN105956158A/en
Application granted granted Critical
Publication of CN105956158B publication Critical patent/CN105956158B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides an automatic extraction method of a network neologism on the basis of mass microblog texts and use information. The automatic extraction method comprises the following steps: obtaining a microblog text and an author identifier corresponding to the microblog; establishing a neologism list; according to a Chinese analysis tool, carrying out a word segmentation operation on the microblog text, obtaining a segmentation word segment, and independently carrying out statistics on the word frequency information of each segmentation word segment on the basis of two dimensions including the text and a user; deleting a word of which the word frequency is smaller than a first frequency threshold value in the neologism list from the neologism list; carrying out statistics on all two-tuples and triples in microblog data, and taking the two-tuples and triples as candidate neologisms; calculating the score of the relevance of the candidate neologisms; adding the word of which the word frequency is greater than a second frequency threshold value and the score of the relevance is greater than a score threshold value in the candidate neologisms into the neologism list; and carrying out iterative execution on the above process until no new neologisms are generated and no candidate neologisms in the neologism list are deleted. The automatic extraction method can automatically extract the network neologism and has high accuracy and low time and space complexity.

Description

The method that network neologisms based on massive micro-blog text and user profile automatically extract
Technical field
The present invention relates to network data excavation technical field, particularly to one based on massive micro-blog text and user profile The method that automatically extracts of network neologisms.
Background technology
New word discovery is a pith in Chinese natural language treatment research field.Neologisms refer to be not present in tradition word Word in allusion quotation.And in the Internet, particularly in social networks, neologisms the most constantly emerge in large numbers.Social network user for Want to express intense emotion, or performance personal emotion color, or make that the social networks text oneself issued is more interesting work The reasons such as power, can use network neologisms continually.These neologisms are probably to be abridged by some longer words or sentence and constitute, it is possible to Can be the homophonic word of tradition word, in some instances it may even be possible to be complete and traditional incoherent word of word.Nowadays social networks is the Internet One of important ingredient, is also that expert and the scholar in numerous data mining direction grinds to the analysis of social network media data The hot fields studied carefully.On the one hand, social media data update very fast, and the data volume therefore can studied is the abundantest;Separately On the one hand the most active due to the user of social networks, it is more likely to use some to be different from the novel of traditional text grammatical rules Term, this result also in emerging in multitude of neologisms in social networks, brings the biggest challenge to traditional text analysis technique.
Being different from the language such as English has natural space character word and word to be separated, and the text of Chinese is by Chinese character sequence Row composition, and the unit carrying Chinese text semantic is often word rather than single character.These Chinese words all have There is oneself specific semantic and part of speech.It is exactly by Chinese text that the most most Chinese natural language processes the first step of task Be divided into " the word section " being made up of different words, the step for be referred to as " participle ".Participle operation is largely dependent upon The dictionary that participle is used.The participle mistake of more than 60% all causes owing to can not correctly divide neologisms according to statistics, and this is Because neologisms are not present in the dictionary of participle instrument, result in these neologisms of identification that participle instrument cannot be correct.
Traditional neologisms detection method, mainly has a following several ways: neologisms detection is embedded with in participle task, based on Complicated linguistic rules and knowledge, word detection is converted into classification problem and statistical method.Traditional method is the most difficult To reach higher accuracy rate, and often due to algorithm produces a large amount of candidate's neologisms cause the complexity in its time or space Spend higher.
Summary of the invention
It is contemplated that at least solve one of above-mentioned technical problem.
To this end, it is automatic to it is an object of the invention to propose a kind of network neologisms based on massive micro-blog text and user profile The method extracted, the method, on the basis of microblogging text, has considered user profile, establishes a kind of iterative computation simultaneously Algorithm automatically extracts network neologisms, and the result of generation has higher accuracy and relatively low Time & Space Complexity.
To achieve these goals, embodiment of the invention discloses that a kind of based on massive micro-blog text and user profile The method that network neologisms automatically extract, comprises the following steps: S1: obtaining microblog data, wherein, microblog data includes microblogging text And the author identifier that microblogging is corresponding;S2: setting up neologisms list, wherein, neologisms list initialization is empty set;S3: neologisms are arranged Table adds in the Chinese analysis instrument preset, and according to Chinese analysis instrument, microblogging text is carried out participle operation, with by microblogging Text is mapped to split the set of word section, and adds up each segmentation word section word frequency based on text and two dimensions of user letter respectively Breath;S4: update the word frequency information of corresponding word in described neologisms list according to the word frequency information obtained, and the word frequency is less than The word of the first frequency threshold value is deleted from neologisms list;S5: in being operated by participle, n the segmentation word section occurred continuously is defined as N tuple, all of two tuples and tlv triple in statistics microblog data, and using two tuples and tlv triple as candidate's neologisms;S6: root According to candidate's neologisms in text and the distribution of two dimensions of user, statistics candidate's neologisms are based on text and the word frequency of two dimensions of user Information, and calculate the score of the relatedness of candidate's neologisms;S7: by the word frequency in candidate's neologisms more than the second frequency threshold value and pass Connection property score is added to neologisms list more than the word of score threshold;And S8: iteration performs S2 to S7, until in microblog data Do not have new candidate's neologisms to produce and neologisms list do not have candidate's neologisms be deleted.
The method that network neologisms based on massive micro-blog text and user profile according to embodiments of the present invention automatically extract, Utilize the feature of microblog data, on the basis of microblogging text, considered user profile, establish a kind of iteration meter simultaneously Calculate algorithm and automatically extract network neologisms;And relative to the result of traditional method, the result that the method produces has higher Accuracy and relatively low Time & Space Complexity, have important application in the excavation and analysis of social media data.
It addition, network neologisms based on massive micro-blog text and user profile according to the above embodiment of the present invention carry automatically The method taken can also have a following additional technical characteristic:
In some instances, in described S4, on the basis of microblogging text, carry out according to the user profile of microblog data Neologisms automatically extract.
In some instances, wherein, in described S8, merged in word segmentation result by iterative computation algorithm iteration formula Segmentation word section, wherein, only needs to search two tuples in microblog data and tlv triple in each iterative process.
In some instances, in described S8, also include: after each iteration completes, it may be found that neologisms add neologisms List, and using neologisms list as the User Defined dictionary of default Chinese analysis instrument, in operating at upper once participle, incite somebody to action The neologisms found in last iterative process correctly divide.
In some instances, described S6 farther includes: theoretical based on strengthening mutual information EMI, calculate each word based on The EMI score of the text frequency, particularly as follows:
E M I ( w n ) = l o g N w n / T Π i = 1 n [ ( N w i n - N w n ) / T ] ,
Wherein,WithRepresent word w respectivelynWithThe frequency based on microblogging text, T is the sum of microblogging Mesh, n is parameter n in n tuple, n=2 or 3;
The EMI score between user is calculated according to word distributed intelligence between users, particularly as follows:
u s r E M I ( w n ) = l o g N w n u / T u Π i = 1 n [ ( N w i n u - N w n u ) / T u ] ,
Wherein,WithRepresent word w respectivelynWithBased on user use the frequency, TuIt it is the sum of user Mesh, n is parameter n in n tuple, n=2 or 3;
The relatedness score of candidate's neologisms is obtained according to the EMI score between EMI score based on the text frequency and user, Particularly as follows:
ascore(wn)=EMI (wn)+usrEMI(wn),
Wherein, ascore (wn) it is candidate neologisms wnRelatedness score.
In some instances, in described S1, by microblog data described in web crawlers technical limit spacing.
The additional aspect of the present invention and advantage will part be given in the following description, and part will become from the following description Obtain substantially, or recognized by the practice of the present invention.
Accompanying drawing explanation
Above-mentioned and/or the additional aspect of the present invention and advantage are from combining the accompanying drawings below description to embodiment and will become Substantially with easy to understand, wherein:
Fig. 1 is that network neologisms based on massive micro-blog text and user profile according to embodiments of the present invention automatically extract The flow chart of method;And
Fig. 2 is that based on massive micro-blog text and user profile according to an embodiment of the invention network neologisms carry automatically The overall flow figure of the method taken.
Detailed description of the invention
Embodiments of the invention are described below in detail, and the example of embodiment is shown in the drawings, the most identical Or similar label represents same or similar element or has the element of same or like function.Retouch below with reference to accompanying drawing The embodiment stated is exemplary, is only used for explaining the present invention, and is not considered as limiting the invention.
In describing the invention, it is to be understood that term " " center ", " longitudinally ", " laterally ", " on ", D score, Orientation or the position relationship of the instruction such as "front", "rear", "left", "right", " vertically ", " level ", " top ", " end ", " interior ", " outward " are Based on orientation shown in the drawings or position relationship, it is for only for ease of the description present invention and simplifies description rather than instruction or dark The device or the element that show indication must have specific orientation, with specific azimuth configuration and operation, therefore it is not intended that right The restriction of the present invention.Additionally, term " first ", " second " are only used for describing purpose, and it is not intended that instruction or hint relatively Importance.
In describing the invention, it should be noted that unless otherwise clearly defined and limited, term " is installed ", " phase Even ", " connection " should be interpreted broadly, for example, it may be fixing connection, it is also possible to be to removably connect, or be integrally connected;Can To be mechanical connection, it is also possible to be electrical connection;Can be to be joined directly together, it is also possible to be indirectly connected to by intermediary, Ke Yishi The connection of two element internals.For the ordinary skill in the art, can understand that above-mentioned term is at this with concrete condition Concrete meaning in invention.
The network based on massive micro-blog text and user profile described according to embodiments of the present invention below in conjunction with accompanying drawing is new The method that word automatically extracts.
Fig. 1 is that network neologisms based on massive micro-blog text and user profile according to embodiments of the present invention automatically extract Method.Fig. 2 is that based on massive micro-blog text and user profile according to an embodiment of the invention network neologisms automatically extract The overall flow figure of method.Shown in Fig. 1 and Fig. 2, according to embodiments of the present invention based on massive micro-blog text and user The method that the network neologisms of information automatically extract, comprises the following steps:
Step S1: obtaining microblog data, wherein, microblog data includes microblogging content of text and author corresponding to every microblogging Identifier.In some instances, such as by a large amount of microblog data of web crawlers technical limit spacing.Such as, microblog data collection is obtained Being combined into D, each of which item includes microblogging text DiAnd the user identifier symbol S of correspondencei
Step S2: set up neologisms list, such as, be denoted as W, and wherein, neologisms list initialization is empty set, i.e. at the beginning of neologisms list W Begin as empty set.
Step S3: neologisms list is added in the Chinese analysis instrument preset as User Defined dictionary, and according to the Chinese Language analytical tool is to each microblogging text D in microblog data set DiCarry out participle operation, with by microblogging text DiMap composition Cut the set of word section, and add up each segmentation word section respectively based on text and the word frequency information of two dimensions of user.In other words, i.e. Using obtain each segmentation word section as an elementary cell, be designated as wi, add up each elementary cell w respectivelyiBased on microblogging literary composition Basis and the word frequency information of two dimensions of user profile, be designated as the most respectivelyWith
Step S4: update the word frequency information of corresponding word in neologisms list according to the word frequency information obtained in step S3 and incite somebody to action The word frequency is deleted from neologisms list W less than the word of the first frequency threshold value.
Wherein, in step s 4, on the basis of microblogging text, the user profile fully utilizing microblog data is carried out newly Word automatically extracts, and is different from traditional method and only considers neologisms distribution based on content of text, and the method is according to the spy of microblog data Point, has considered network neologisms in text and the distributed intelligence of two dimensions of user.
Step S5: in the segmentation word section that participle operation obtains, each single segmentation word section that will appear from is as one Elementary cell, is defined as n tuple, all of two occurred in statistics microblog data accordingly by the segmentation word section that n occurs continuously Tuple and tlv triple, and using two tuples and tlv triple as candidate's neologisms.
Step S6: according to candidate's neologisms in text and the distribution of two dimensions of user, statistics candidate's neologisms based on text and The word frequency information of two dimensions of user, and calculate the relatedness score of candidate's neologisms.
In some instances, this step specifically includes:
First, theoretical based on strengthening mutual information EMI, calculate each word EMI based on text frequency score, particularly as follows:
E M I ( w n ) = l o g N w n / T Π i = 1 n [ ( N w i n - N w n ) / T ] ,
Wherein,WithRepresent word w respectivelynWithThe frequency based on microblogging text, T is the sum of microblogging Mesh, n is parameter n in n tuple, n=2 or 3;
Secondly, calculate the EMI score between user according to word distributed intelligence between users, particularly as follows:
u s r E M I ( w n ) = l o g N w n u / T u Π i = 1 n [ ( N w i n u - N w n u ) / T u ] ,
Wherein,WithRepresent word w respectivelynWithBased on user use the frequency, TuIt it is the sum of user Mesh, n is parameter n in n tuple, n=2 or 3;
Finally, the relatedness of candidate's neologisms is obtained according to the EMI score between EMI score based on the text frequency and user Score, particularly as follows:
ascore(wn)=EMI (wn)+usrEMI(wn),
Wherein, ascore (wn) it is candidate neologisms wnRelatedness score.
Step S7: the word frequency in candidate's neologisms is more than score threshold more than the second frequency threshold value and relatedness score Word adds to neologisms list.
Step S8: iteration execution step S2 is to step S7, until not having new candidate's neologisms to produce in microblog data and new Word list does not have candidate's neologisms be deleted.It is to say, in this step, establish a kind of iterative computation algorithm, thus can The segmentation word section in word segmentation result is merged with iterative.In each iterative process, so have only to find the n tuple of low order (such as two tuple and tlv triple), and traditional method is in order to find that long neologisms need to find the n tuple of high-order, because of The quantity of this candidate word is exponentially increased along with the increase of n.Therefore the method for the embodiment of the present invention is relative to traditional method, with repeatedly The computational algorithm in generation instead of the direct computational algorithm in traditional method, significantly reduces the room and time complexity of method.
Further, in step s 8, after each iteration completes, it may be found that neologisms add neologisms list, and will be new Word list is as the User Defined dictionary of default Chinese analysis instrument, in operating at upper once participle, by last time iteration During find neologisms correctly divide, such that it is able to continue to optimize word segmentation result.And word segmentation result after You Huaing and then The quality of candidate's neologisms can be improved.
To sum up, network neologisms based on massive micro-blog text and user profile according to embodiments of the present invention automatically extract Method, utilizes the feature of microblog data, on the basis of microblogging text, has considered user profile, establishes one simultaneously Iterative computation algorithm automatically extracts network neologisms;And relative to the result of traditional method, the result that the method produces has Higher accuracy and relatively low Time & Space Complexity, have important answering in the excavation and analysis of social media data With.
For the ease of being more fully understood that the present invention, below as a example by Sina's microblog data, it is right to come in conjunction with specific embodiments The method that the network neologisms based on massive micro-blog text and user profile of the above embodiment of the present invention automatically extract is done further Describe in detail.
In the present embodiment, the method such as comprises the following steps:
Step 1: utilize the API (application programming interface that Sina's microblogging provides) that Sina's microblogging is corresponding to crawl a large amount of new The microblog data that wave microblog users is issued, including identifier symbol (user name or the user of all microblogging texts and corresponding user The attribute at the energy only table requisition such as ID family), data acquisition system is designated as D, and each of which item includes microblogging content of text DiAnd user Identifier symbol Si;I.e. D={ (Di,Si) | i=1,2,3 ..., do data for follow-up neologisms Detection task and prepare.
Step 2: define a neologisms list W, initializing neologisms list W is empty set.The method can be to W continuous updating, and W is Whole content is exactly the output result of whole method, the network neologisms i.e. extracted.
Step 3: using neologisms list W as the User Defined dictionary of Chinese analysis instrument, and utilize this Chinese analysis work Tool is to each content of text D in microblog data set DiCarry out participle operation, thus obtain the segmentation word section of each microblogging Set, is designated as w, w={wi| i=1,2,3 ..., wiWord section is split for each.Neologisms list W is continuously updated so that newfound Word segmentation result can be continued to optimize in network neologisms.
Step 4: according to the segmentation word section in the word segmentation result obtained in step 3, add up the word frequency information of each word section.Right Word section w is split in eachi, add up it respectively at microblogging text and the word frequency information of two dimensions of user.Particularly as follows: for often One word section wi, in this dimension of microblogging content of text, statistics has a how many microblogging to contain word section wi, result is designated asAnd Using in the dimension of information based on user, statistics has a how many user to employ word section wi, result is designated asTherefore, this step It is finally completed and has used information to the mapping of word section frequency table from microblogging content of text and user.Finally, the word section that will obtain Frequency table is designated as G,
Step 5: after statistics obtains the frequency information of each segmentation word section, utilize this information that neologisms list W is carried out Update.For each word in neologisms list W, according to the information in word section frequency table G, by corresponding microblogging content of text word frequencyFilter from W less than the word of the first frequency threshold value.And why after participle each time, the word in W is carried out word frequency system Meter, carries out filtering operation the most again, is because adding neologisms list W as the User Defined dictionary of Chinese analysis instrument In the middle of participle operation, participle instrument can be allowed to find according to each word in W microblogging text is more suitably divided, thus Optimize word segmentation result, and the neologisms of the mistake found before can be filtered according to the word segmentation result after optimizing again.
Step 6: utilize segmentation word section to find all of two tuples and tlv triple in massive micro-blog text.Two tuples are by micro- Two segmentation word sections w adjacent in rich content of textiComposition, is designated as w2=w1w2, and tlv triple is phase in microblogging content of text Three adjacent segmentation word sections wiComposition, is designated as w3=w1w2w3, it is designated as w by unified to two tuples and tlv triplen, these two tuples and three Tuple i.e. constitutes candidate's neologisms.
Step 7: for each candidate neologisms wn, utilize the method identical with statistics word section frequency table before, respectively base In its place content of text and two dimensions of corresponding user information, adding up its word frequency information, result is designated as F,Wherein, result F obtained be comprise the set of likely candidate's neologisms.Further Ground, whether method based on statistical learning, calculating each word in F according to its text word frequency information and user's service condition is Neologisms.Being primarily based on strengthening mutual information EMI theoretical, calculate each word EMI based on text frequency score, formula is as follows:
E M I ( w n ) = l o g N w n / T Π i = 1 n [ ( N w i n - N w n ) / T ] ,
Wherein,WithRepresent word w respectivelynWithThe frequency based on microblogging text, T is the sum of microblogging Mesh, n is parameter n (n=2 or 3) in n tuple.Word wnEMI obtain the highest, represent composition this word wnEach segmentation Word section has higher relatedness, then this word wnMore it is likely to be network neologisms.
Then, utilizing word distributed intelligence between users to calculate the EMI score between user, formula is as follows:
u s r E M I ( w n ) = l o g N w n u / T u Π i = 1 n [ ( N w i n u - N w n u ) / T u ] ,
Wherein,WithRepresent word w respectivelynWithBased on user use the frequency, TuIt it is the sum of user Mesh, n is parameter n (n=2 or 3) in n tuple.Word wnUser EMI obtain the highest, represent this word wnMay be more User used, and have higher relatedness among different users, then this word wnMore it is likely to be popular net Network neologisms.
Finally, by candidate neologisms wnRelatedness score ascore be defined as:
ascore(wn)=EMI (wn)+usrEMI(wn),
Wherein, for a candidate neologisms wn, its relatedness score is the highest, illustrates to form each segmentation word of this word Section uses at microblogging text and user has higher relatedness in two dimensions.Simultaneously because this word wnNot by participle instrument It is correctly detected, therefore wnIt is probably from microblogging the user-defined popular vocabulary found, i.e. network neologisms.
According to priori, " neologisms " be everybody receptible emerging, have certain semantic and be not present in passing Word in system dictionary, therefore neologisms must be widely used by a lot of different users.The frequency information of word and its relatedness Score can well reflect These parameters, if therefore word wnRelatedness score more than relatedness score threshold, and Word wnThe frequency also greater than frequency threshold value, then by word wnAdd candidate neologisms list W.
Above step 7 is an iteration of the method, constantly repeats above step, until certain an iteration is not the most produced The word of tissue regeneration promoting adds neologisms list W, and does not also have the word in neologisms list to be deleted, then terminate iterative process, now Neologisms list W in each item be the present invention method extract network neologisms.
To sum up, the method has a characteristic that the word distribution letter that make use of in microblog users dimension in the present embodiment Breath.Relative to traditional method, the method utilizes statistical method, theoretical based on strengthening mutual information (EMI), not only exists neologisms Distribution in content of text dimension is analyzed, and utilizes the characteristic of this network information carriers of microblogging simultaneously, analyzes neologisms and exists Use distribution situation between different user, this point can relatively significantly promote the accuracy rate of the neologisms that the method finds.Separately Outward, this method establishes a kind of iterative computation algorithm and carries out neologisms and automatically extract step, and is different from traditional method and directly carries out Calculate.First, this point can be effectively reduced the Time & Space Complexity of the method.Original neologisms based on EMI detection In algorithm, in order to once find all of neologisms, need to find the n tuple of high-order, i.e. find the company being arbitrarily not more than n in text The combination of continuous segmentation word section.However as the increase of n, the quantity of candidate word presents exponential increase, to internal memory and the consumption of time Also it is increased dramatically.And the method uses the mode of iteration, can only use two tuples and tlv triple in each iterative process, it Finding longer word combination by union operation repeatedly afterwards, therefore the method can effectively reduce the demand to internal memory, tool There is relatively low Time & Space Complexity.On the other hand, in the step of iteration each time, the candidate recognized can be produced new Word, before the method utilizes these candidate's neologisms to optimize participle next time and operates, and then the word segmentation result after utilization optimization filters Underproof item in the new set of words found, this point can promote the accuracy rate of the neologisms that the method finds further.
In the description of this specification, reference term " embodiment ", " some embodiments ", " example ", " specifically show Example " or the description of " some examples " etc. means to combine this embodiment or example describes specific features, structure, material or spy Point is contained at least one embodiment or the example of the present invention.In this manual, to the schematic representation of above-mentioned term not Necessarily refer to identical embodiment or example.And, the specific features of description, structure, material or feature can be any One or more embodiments or example in combine in an appropriate manner.
Although an embodiment of the present invention has been shown and described, it will be understood by those skilled in the art that: not These embodiments can be carried out multiple change in the case of departing from the principle of the present invention and objective, revise, replace and modification, this The scope of invention is limited by claim and equivalent thereof.

Claims (5)

1. the method that network neologisms based on massive micro-blog text and user profile automatically extract, it is characterised in that include Following steps:
S1: obtaining microblog data, wherein, described microblog data includes microblogging text and author identifier corresponding to microblogging;
S2: setting up neologisms list, wherein, described neologisms list initialization is empty set;
S3: described neologisms list is added in the Chinese analysis instrument preset, and according to described Chinese analysis instrument to described micro- Blog article originally carries out participle operation, so that described microblogging text to be mapped to split the set of word section, and adds up each segmentation word respectively Section is based on text and the word frequency information of two dimensions of user;
S4: update the word frequency information of corresponding word in described neologisms list according to the word frequency information obtained, and by little for the word frequency Word in the first frequency threshold value is deleted from described neologisms list;
S5: in being operated by participle, n the segmentation word section occurred continuously is defined as n tuple, all of binary in statistics microblog data Group and tlv triple, and using described two tuples and tlv triple as candidate's neologisms;
S6: according to described candidate's neologisms in text and the distribution of two dimensions of user, add up described candidate's neologisms based on text and The word frequency information of two dimensions of user, and calculate the relatedness score of described candidate's neologisms;
S7: the word frequency in described candidate's neologisms is more than more than the second frequency threshold value and relatedness score the word of score threshold Add to described neologisms list;And
S8: iteration performs described S2 to S7, until not having new candidate's neologisms to produce in described microblog data and described neologisms row Table does not have candidate's neologisms be deleted.
The method that network neologisms based on massive micro-blog text and user profile the most according to claim 1 automatically extract, It is characterized in that, in described S4, on the basis of described microblogging text, carry out neologisms certainly according to the user profile of microblog data Dynamic extraction.
The method that network neologisms based on massive micro-blog text and user profile the most according to claim 1 automatically extract, It is characterized in that, wherein, in described S8, merge the segmentation word section in word segmentation result by iterative computation algorithm iteration formula, Wherein, only need to search two tuples in microblog data and tlv triple in each iterative process.
The method that network neologisms based on massive micro-blog text and user profile the most according to claim 3 automatically extract, It is characterized in that, in described S8, also include:
After each iteration completes, it may be found that neologisms add described neologisms list, and using described neologisms list as the default Chinese The User Defined dictionary of language analytical tool, in operating at upper once participle, the neologisms that will find in last iterative process Correctly divide.
The method that network neologisms based on massive micro-blog text and user profile the most according to claim 1 automatically extract, It is characterized in that, described S6 farther includes:
Theoretical based on strengthening mutual information EMI, calculate each word EMI based on text frequency score, particularly as follows:
E M I ( w n ) = l o g N w n / T Π i = 1 n [ ( N w i n - N w n ) / T ] ,
Wherein,WithRepresent word w respectivelynWithThe frequency based on microblogging text, T is the total number of microblogging, and n is n Parameter n in tuple, n=2 or 3;
The EMI score between user is calculated according to word distributed intelligence between users, particularly as follows:
u s r E M I ( w n ) = l o g N w n u / T u Π i = 1 n [ ( N w i n u - N w n u ) / T u ] ,
Wherein,WithRepresent word w respectivelynWithBased on user use the frequency, TuBeing the total number of user, n is Parameter n in n tuple, n=2 or 3;
The relatedness score of candidate's neologisms is obtained according to the EMI score between described EMI score based on the text frequency and user, Particularly as follows:
ascore(wn)=EMI (wn)+usrEMI(wn),
Wherein, ascore (wn) it is candidate neologisms wnRelatedness score.
CN201610324541.9A 2016-05-17 2016-05-17 The method that network neologisms based on massive micro-blog text and user information automatically extract Active CN105956158B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610324541.9A CN105956158B (en) 2016-05-17 2016-05-17 The method that network neologisms based on massive micro-blog text and user information automatically extract

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610324541.9A CN105956158B (en) 2016-05-17 2016-05-17 The method that network neologisms based on massive micro-blog text and user information automatically extract

Publications (2)

Publication Number Publication Date
CN105956158A true CN105956158A (en) 2016-09-21
CN105956158B CN105956158B (en) 2019-08-09

Family

ID=56912577

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610324541.9A Active CN105956158B (en) 2016-05-17 2016-05-17 The method that network neologisms based on massive micro-blog text and user information automatically extract

Country Status (1)

Country Link
CN (1) CN105956158B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106528523A (en) * 2016-09-22 2017-03-22 中山大学 Network neologism identification method
CN107992501A (en) * 2016-10-27 2018-05-04 腾讯科技(深圳)有限公司 Social network information recognition methods, processing method and processing device
CN108509425A (en) * 2018-04-10 2018-09-07 中国人民解放军陆军工程大学 A kind of Chinese new word discovery method based on novel degree
WO2019085335A1 (en) * 2017-11-01 2019-05-09 平安科技(深圳)有限公司 Method for discovering investment objects with new words, device and storage medium
WO2021027085A1 (en) * 2019-08-15 2021-02-18 苏州朗动网络科技有限公司 Method and device for automatically extracting text keyword, and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005135311A (en) * 2003-10-31 2005-05-26 Nippon Telegr & Teleph Corp <Ntt> Category-classified new feature word ranking method, apparatus and program, and computer readable storage medium recorded with category-classified new feature word ranking program
CN1924858A (en) * 2006-08-09 2007-03-07 北京搜狗科技发展有限公司 Method and device for fetching new words and input method system
CN103678656A (en) * 2013-12-23 2014-03-26 合肥工业大学 Unsupervised automatic extraction method of microblog new words based on repeated word strings
US20140288924A1 (en) * 2008-06-06 2014-09-25 Zi Corporation Of Canada, Inc. Systems and methods for an automated personalized dictionary generator for portable devices

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005135311A (en) * 2003-10-31 2005-05-26 Nippon Telegr & Teleph Corp <Ntt> Category-classified new feature word ranking method, apparatus and program, and computer readable storage medium recorded with category-classified new feature word ranking program
CN1924858A (en) * 2006-08-09 2007-03-07 北京搜狗科技发展有限公司 Method and device for fetching new words and input method system
US20140288924A1 (en) * 2008-06-06 2014-09-25 Zi Corporation Of Canada, Inc. Systems and methods for an automated personalized dictionary generator for portable devices
CN103678656A (en) * 2013-12-23 2014-03-26 合肥工业大学 Unsupervised automatic extraction method of microblog new words based on repeated word strings

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
PENG F, FENG F, MCCALLUM A.: "Chinese segmentation and new word detection using conditional random fields", 《PROCEEDINGS OF THE 20TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL LINGUISTICS. ASSOCIATION FOR COMPUTATIONAL LINGUISTICS》 *
李钝: "Internet中的新词识别", 《北京邮电大学学报》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106528523A (en) * 2016-09-22 2017-03-22 中山大学 Network neologism identification method
CN106528523B (en) * 2016-09-22 2019-05-10 中山大学 A kind of network new word identification method
CN107992501A (en) * 2016-10-27 2018-05-04 腾讯科技(深圳)有限公司 Social network information recognition methods, processing method and processing device
CN107992501B (en) * 2016-10-27 2021-12-14 腾讯科技(深圳)有限公司 Social network information identification method, processing method and device
WO2019085335A1 (en) * 2017-11-01 2019-05-09 平安科技(深圳)有限公司 Method for discovering investment objects with new words, device and storage medium
CN108509425A (en) * 2018-04-10 2018-09-07 中国人民解放军陆军工程大学 A kind of Chinese new word discovery method based on novel degree
CN108509425B (en) * 2018-04-10 2021-08-24 中国人民解放军陆军工程大学 Chinese new word discovery method based on novelty
WO2021027085A1 (en) * 2019-08-15 2021-02-18 苏州朗动网络科技有限公司 Method and device for automatically extracting text keyword, and storage medium

Also Published As

Publication number Publication date
CN105956158B (en) 2019-08-09

Similar Documents

Publication Publication Date Title
CN107766324B (en) Text consistency analysis method based on deep neural network
CN108287922B (en) Text data viewpoint abstract mining method fusing topic attributes and emotional information
Li et al. Recursive deep models for discourse parsing
CN104915340B (en) Natural language question-answering method and device
CN108874878A (en) A kind of building system and method for knowledge mapping
CN108038205B (en) Viewpoint analysis prototype system for Chinese microblogs
CN105528437B (en) A kind of question answering system construction method extracted based on structured text knowledge
CN104008166B (en) Dialogue short text clustering method based on form and semantic similarity
CN107463553A (en) For the text semantic extraction, expression and modeling method and system of elementary mathematics topic
CN102253930B (en) A kind of method of text translation and device
CN111190900B (en) JSON data visualization optimization method in cloud computing mode
CN105512245A (en) Enterprise figure building method based on regression model
CN106844346A (en) Short text Semantic Similarity method of discrimination and system based on deep learning model Word2Vec
CN105956158A (en) Automatic extraction method of network neologism on the basis of mass microblog texts and use information
CN109902302B (en) Topic map generation method, device and equipment suitable for text analysis or data mining and computer storage medium
CN104778256B (en) A kind of the quick of field question answering system consulting can increment clustering method
CN108363725A (en) A kind of method of the extraction of user comment viewpoint and the generation of viewpoint label
CN103793501A (en) Theme community discovery method based on social network
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
CN104699797A (en) Webpage data structured analytic method and device
CN104346382B (en) Use the text analysis system and method for language inquiry
Nakashole et al. Real-time population of knowledge bases: opportunities and challenges
Sun et al. Graph force learning
CN104572633A (en) Method for determining meanings of polysemous word
Gupta et al. Keyword extraction: a review

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant