CN105956158A - Automatic extraction method of network neologism on the basis of mass microblog texts and use information - Google Patents
Automatic extraction method of network neologism on the basis of mass microblog texts and use information Download PDFInfo
- Publication number
- CN105956158A CN105956158A CN201610324541.9A CN201610324541A CN105956158A CN 105956158 A CN105956158 A CN 105956158A CN 201610324541 A CN201610324541 A CN 201610324541A CN 105956158 A CN105956158 A CN 105956158A
- Authority
- CN
- China
- Prior art keywords
- neologisms
- word
- text
- list
- user
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides an automatic extraction method of a network neologism on the basis of mass microblog texts and use information. The automatic extraction method comprises the following steps: obtaining a microblog text and an author identifier corresponding to the microblog; establishing a neologism list; according to a Chinese analysis tool, carrying out a word segmentation operation on the microblog text, obtaining a segmentation word segment, and independently carrying out statistics on the word frequency information of each segmentation word segment on the basis of two dimensions including the text and a user; deleting a word of which the word frequency is smaller than a first frequency threshold value in the neologism list from the neologism list; carrying out statistics on all two-tuples and triples in microblog data, and taking the two-tuples and triples as candidate neologisms; calculating the score of the relevance of the candidate neologisms; adding the word of which the word frequency is greater than a second frequency threshold value and the score of the relevance is greater than a score threshold value in the candidate neologisms into the neologism list; and carrying out iterative execution on the above process until no new neologisms are generated and no candidate neologisms in the neologism list are deleted. The automatic extraction method can automatically extract the network neologism and has high accuracy and low time and space complexity.
Description
Technical field
The present invention relates to network data excavation technical field, particularly to one based on massive micro-blog text and user profile
The method that automatically extracts of network neologisms.
Background technology
New word discovery is a pith in Chinese natural language treatment research field.Neologisms refer to be not present in tradition word
Word in allusion quotation.And in the Internet, particularly in social networks, neologisms the most constantly emerge in large numbers.Social network user for
Want to express intense emotion, or performance personal emotion color, or make that the social networks text oneself issued is more interesting work
The reasons such as power, can use network neologisms continually.These neologisms are probably to be abridged by some longer words or sentence and constitute, it is possible to
Can be the homophonic word of tradition word, in some instances it may even be possible to be complete and traditional incoherent word of word.Nowadays social networks is the Internet
One of important ingredient, is also that expert and the scholar in numerous data mining direction grinds to the analysis of social network media data
The hot fields studied carefully.On the one hand, social media data update very fast, and the data volume therefore can studied is the abundantest;Separately
On the one hand the most active due to the user of social networks, it is more likely to use some to be different from the novel of traditional text grammatical rules
Term, this result also in emerging in multitude of neologisms in social networks, brings the biggest challenge to traditional text analysis technique.
Being different from the language such as English has natural space character word and word to be separated, and the text of Chinese is by Chinese character sequence
Row composition, and the unit carrying Chinese text semantic is often word rather than single character.These Chinese words all have
There is oneself specific semantic and part of speech.It is exactly by Chinese text that the most most Chinese natural language processes the first step of task
Be divided into " the word section " being made up of different words, the step for be referred to as " participle ".Participle operation is largely dependent upon
The dictionary that participle is used.The participle mistake of more than 60% all causes owing to can not correctly divide neologisms according to statistics, and this is
Because neologisms are not present in the dictionary of participle instrument, result in these neologisms of identification that participle instrument cannot be correct.
Traditional neologisms detection method, mainly has a following several ways: neologisms detection is embedded with in participle task, based on
Complicated linguistic rules and knowledge, word detection is converted into classification problem and statistical method.Traditional method is the most difficult
To reach higher accuracy rate, and often due to algorithm produces a large amount of candidate's neologisms cause the complexity in its time or space
Spend higher.
Summary of the invention
It is contemplated that at least solve one of above-mentioned technical problem.
To this end, it is automatic to it is an object of the invention to propose a kind of network neologisms based on massive micro-blog text and user profile
The method extracted, the method, on the basis of microblogging text, has considered user profile, establishes a kind of iterative computation simultaneously
Algorithm automatically extracts network neologisms, and the result of generation has higher accuracy and relatively low Time & Space Complexity.
To achieve these goals, embodiment of the invention discloses that a kind of based on massive micro-blog text and user profile
The method that network neologisms automatically extract, comprises the following steps: S1: obtaining microblog data, wherein, microblog data includes microblogging text
And the author identifier that microblogging is corresponding;S2: setting up neologisms list, wherein, neologisms list initialization is empty set;S3: neologisms are arranged
Table adds in the Chinese analysis instrument preset, and according to Chinese analysis instrument, microblogging text is carried out participle operation, with by microblogging
Text is mapped to split the set of word section, and adds up each segmentation word section word frequency based on text and two dimensions of user letter respectively
Breath;S4: update the word frequency information of corresponding word in described neologisms list according to the word frequency information obtained, and the word frequency is less than
The word of the first frequency threshold value is deleted from neologisms list;S5: in being operated by participle, n the segmentation word section occurred continuously is defined as
N tuple, all of two tuples and tlv triple in statistics microblog data, and using two tuples and tlv triple as candidate's neologisms;S6: root
According to candidate's neologisms in text and the distribution of two dimensions of user, statistics candidate's neologisms are based on text and the word frequency of two dimensions of user
Information, and calculate the score of the relatedness of candidate's neologisms;S7: by the word frequency in candidate's neologisms more than the second frequency threshold value and pass
Connection property score is added to neologisms list more than the word of score threshold;And S8: iteration performs S2 to S7, until in microblog data
Do not have new candidate's neologisms to produce and neologisms list do not have candidate's neologisms be deleted.
The method that network neologisms based on massive micro-blog text and user profile according to embodiments of the present invention automatically extract,
Utilize the feature of microblog data, on the basis of microblogging text, considered user profile, establish a kind of iteration meter simultaneously
Calculate algorithm and automatically extract network neologisms;And relative to the result of traditional method, the result that the method produces has higher
Accuracy and relatively low Time & Space Complexity, have important application in the excavation and analysis of social media data.
It addition, network neologisms based on massive micro-blog text and user profile according to the above embodiment of the present invention carry automatically
The method taken can also have a following additional technical characteristic:
In some instances, in described S4, on the basis of microblogging text, carry out according to the user profile of microblog data
Neologisms automatically extract.
In some instances, wherein, in described S8, merged in word segmentation result by iterative computation algorithm iteration formula
Segmentation word section, wherein, only needs to search two tuples in microblog data and tlv triple in each iterative process.
In some instances, in described S8, also include: after each iteration completes, it may be found that neologisms add neologisms
List, and using neologisms list as the User Defined dictionary of default Chinese analysis instrument, in operating at upper once participle, incite somebody to action
The neologisms found in last iterative process correctly divide.
In some instances, described S6 farther includes: theoretical based on strengthening mutual information EMI, calculate each word based on
The EMI score of the text frequency, particularly as follows:
Wherein,WithRepresent word w respectivelynWithThe frequency based on microblogging text, T is the sum of microblogging
Mesh, n is parameter n in n tuple, n=2 or 3;
The EMI score between user is calculated according to word distributed intelligence between users, particularly as follows:
Wherein,WithRepresent word w respectivelynWithBased on user use the frequency, TuIt it is the sum of user
Mesh, n is parameter n in n tuple, n=2 or 3;
The relatedness score of candidate's neologisms is obtained according to the EMI score between EMI score based on the text frequency and user,
Particularly as follows:
ascore(wn)=EMI (wn)+usrEMI(wn),
Wherein, ascore (wn) it is candidate neologisms wnRelatedness score.
In some instances, in described S1, by microblog data described in web crawlers technical limit spacing.
The additional aspect of the present invention and advantage will part be given in the following description, and part will become from the following description
Obtain substantially, or recognized by the practice of the present invention.
Accompanying drawing explanation
Above-mentioned and/or the additional aspect of the present invention and advantage are from combining the accompanying drawings below description to embodiment and will become
Substantially with easy to understand, wherein:
Fig. 1 is that network neologisms based on massive micro-blog text and user profile according to embodiments of the present invention automatically extract
The flow chart of method;And
Fig. 2 is that based on massive micro-blog text and user profile according to an embodiment of the invention network neologisms carry automatically
The overall flow figure of the method taken.
Detailed description of the invention
Embodiments of the invention are described below in detail, and the example of embodiment is shown in the drawings, the most identical
Or similar label represents same or similar element or has the element of same or like function.Retouch below with reference to accompanying drawing
The embodiment stated is exemplary, is only used for explaining the present invention, and is not considered as limiting the invention.
In describing the invention, it is to be understood that term " " center ", " longitudinally ", " laterally ", " on ", D score,
Orientation or the position relationship of the instruction such as "front", "rear", "left", "right", " vertically ", " level ", " top ", " end ", " interior ", " outward " are
Based on orientation shown in the drawings or position relationship, it is for only for ease of the description present invention and simplifies description rather than instruction or dark
The device or the element that show indication must have specific orientation, with specific azimuth configuration and operation, therefore it is not intended that right
The restriction of the present invention.Additionally, term " first ", " second " are only used for describing purpose, and it is not intended that instruction or hint relatively
Importance.
In describing the invention, it should be noted that unless otherwise clearly defined and limited, term " is installed ", " phase
Even ", " connection " should be interpreted broadly, for example, it may be fixing connection, it is also possible to be to removably connect, or be integrally connected;Can
To be mechanical connection, it is also possible to be electrical connection;Can be to be joined directly together, it is also possible to be indirectly connected to by intermediary, Ke Yishi
The connection of two element internals.For the ordinary skill in the art, can understand that above-mentioned term is at this with concrete condition
Concrete meaning in invention.
The network based on massive micro-blog text and user profile described according to embodiments of the present invention below in conjunction with accompanying drawing is new
The method that word automatically extracts.
Fig. 1 is that network neologisms based on massive micro-blog text and user profile according to embodiments of the present invention automatically extract
Method.Fig. 2 is that based on massive micro-blog text and user profile according to an embodiment of the invention network neologisms automatically extract
The overall flow figure of method.Shown in Fig. 1 and Fig. 2, according to embodiments of the present invention based on massive micro-blog text and user
The method that the network neologisms of information automatically extract, comprises the following steps:
Step S1: obtaining microblog data, wherein, microblog data includes microblogging content of text and author corresponding to every microblogging
Identifier.In some instances, such as by a large amount of microblog data of web crawlers technical limit spacing.Such as, microblog data collection is obtained
Being combined into D, each of which item includes microblogging text DiAnd the user identifier symbol S of correspondencei。
Step S2: set up neologisms list, such as, be denoted as W, and wherein, neologisms list initialization is empty set, i.e. at the beginning of neologisms list W
Begin as empty set.
Step S3: neologisms list is added in the Chinese analysis instrument preset as User Defined dictionary, and according to the Chinese
Language analytical tool is to each microblogging text D in microblog data set DiCarry out participle operation, with by microblogging text DiMap composition
Cut the set of word section, and add up each segmentation word section respectively based on text and the word frequency information of two dimensions of user.In other words, i.e.
Using obtain each segmentation word section as an elementary cell, be designated as wi, add up each elementary cell w respectivelyiBased on microblogging literary composition
Basis and the word frequency information of two dimensions of user profile, be designated as the most respectivelyWith
Step S4: update the word frequency information of corresponding word in neologisms list according to the word frequency information obtained in step S3 and incite somebody to action
The word frequency is deleted from neologisms list W less than the word of the first frequency threshold value.
Wherein, in step s 4, on the basis of microblogging text, the user profile fully utilizing microblog data is carried out newly
Word automatically extracts, and is different from traditional method and only considers neologisms distribution based on content of text, and the method is according to the spy of microblog data
Point, has considered network neologisms in text and the distributed intelligence of two dimensions of user.
Step S5: in the segmentation word section that participle operation obtains, each single segmentation word section that will appear from is as one
Elementary cell, is defined as n tuple, all of two occurred in statistics microblog data accordingly by the segmentation word section that n occurs continuously
Tuple and tlv triple, and using two tuples and tlv triple as candidate's neologisms.
Step S6: according to candidate's neologisms in text and the distribution of two dimensions of user, statistics candidate's neologisms based on text and
The word frequency information of two dimensions of user, and calculate the relatedness score of candidate's neologisms.
In some instances, this step specifically includes:
First, theoretical based on strengthening mutual information EMI, calculate each word EMI based on text frequency score, particularly as follows:
Wherein,WithRepresent word w respectivelynWithThe frequency based on microblogging text, T is the sum of microblogging
Mesh, n is parameter n in n tuple, n=2 or 3;
Secondly, calculate the EMI score between user according to word distributed intelligence between users, particularly as follows:
Wherein,WithRepresent word w respectivelynWithBased on user use the frequency, TuIt it is the sum of user
Mesh, n is parameter n in n tuple, n=2 or 3;
Finally, the relatedness of candidate's neologisms is obtained according to the EMI score between EMI score based on the text frequency and user
Score, particularly as follows:
ascore(wn)=EMI (wn)+usrEMI(wn),
Wherein, ascore (wn) it is candidate neologisms wnRelatedness score.
Step S7: the word frequency in candidate's neologisms is more than score threshold more than the second frequency threshold value and relatedness score
Word adds to neologisms list.
Step S8: iteration execution step S2 is to step S7, until not having new candidate's neologisms to produce in microblog data and new
Word list does not have candidate's neologisms be deleted.It is to say, in this step, establish a kind of iterative computation algorithm, thus can
The segmentation word section in word segmentation result is merged with iterative.In each iterative process, so have only to find the n tuple of low order
(such as two tuple and tlv triple), and traditional method is in order to find that long neologisms need to find the n tuple of high-order, because of
The quantity of this candidate word is exponentially increased along with the increase of n.Therefore the method for the embodiment of the present invention is relative to traditional method, with repeatedly
The computational algorithm in generation instead of the direct computational algorithm in traditional method, significantly reduces the room and time complexity of method.
Further, in step s 8, after each iteration completes, it may be found that neologisms add neologisms list, and will be new
Word list is as the User Defined dictionary of default Chinese analysis instrument, in operating at upper once participle, by last time iteration
During find neologisms correctly divide, such that it is able to continue to optimize word segmentation result.And word segmentation result after You Huaing and then
The quality of candidate's neologisms can be improved.
To sum up, network neologisms based on massive micro-blog text and user profile according to embodiments of the present invention automatically extract
Method, utilizes the feature of microblog data, on the basis of microblogging text, has considered user profile, establishes one simultaneously
Iterative computation algorithm automatically extracts network neologisms;And relative to the result of traditional method, the result that the method produces has
Higher accuracy and relatively low Time & Space Complexity, have important answering in the excavation and analysis of social media data
With.
For the ease of being more fully understood that the present invention, below as a example by Sina's microblog data, it is right to come in conjunction with specific embodiments
The method that the network neologisms based on massive micro-blog text and user profile of the above embodiment of the present invention automatically extract is done further
Describe in detail.
In the present embodiment, the method such as comprises the following steps:
Step 1: utilize the API (application programming interface that Sina's microblogging provides) that Sina's microblogging is corresponding to crawl a large amount of new
The microblog data that wave microblog users is issued, including identifier symbol (user name or the user of all microblogging texts and corresponding user
The attribute at the energy only table requisition such as ID family), data acquisition system is designated as D, and each of which item includes microblogging content of text DiAnd user
Identifier symbol Si;I.e. D={ (Di,Si) | i=1,2,3 ..., do data for follow-up neologisms Detection task and prepare.
Step 2: define a neologisms list W, initializing neologisms list W is empty set.The method can be to W continuous updating, and W is
Whole content is exactly the output result of whole method, the network neologisms i.e. extracted.
Step 3: using neologisms list W as the User Defined dictionary of Chinese analysis instrument, and utilize this Chinese analysis work
Tool is to each content of text D in microblog data set DiCarry out participle operation, thus obtain the segmentation word section of each microblogging
Set, is designated as w, w={wi| i=1,2,3 ..., wiWord section is split for each.Neologisms list W is continuously updated so that newfound
Word segmentation result can be continued to optimize in network neologisms.
Step 4: according to the segmentation word section in the word segmentation result obtained in step 3, add up the word frequency information of each word section.Right
Word section w is split in eachi, add up it respectively at microblogging text and the word frequency information of two dimensions of user.Particularly as follows: for often
One word section wi, in this dimension of microblogging content of text, statistics has a how many microblogging to contain word section wi, result is designated asAnd
Using in the dimension of information based on user, statistics has a how many user to employ word section wi, result is designated asTherefore, this step
It is finally completed and has used information to the mapping of word section frequency table from microblogging content of text and user.Finally, the word section that will obtain
Frequency table is designated as G,
Step 5: after statistics obtains the frequency information of each segmentation word section, utilize this information that neologisms list W is carried out
Update.For each word in neologisms list W, according to the information in word section frequency table G, by corresponding microblogging content of text word frequencyFilter from W less than the word of the first frequency threshold value.And why after participle each time, the word in W is carried out word frequency system
Meter, carries out filtering operation the most again, is because adding neologisms list W as the User Defined dictionary of Chinese analysis instrument
In the middle of participle operation, participle instrument can be allowed to find according to each word in W microblogging text is more suitably divided, thus
Optimize word segmentation result, and the neologisms of the mistake found before can be filtered according to the word segmentation result after optimizing again.
Step 6: utilize segmentation word section to find all of two tuples and tlv triple in massive micro-blog text.Two tuples are by micro-
Two segmentation word sections w adjacent in rich content of textiComposition, is designated as w2=w1w2, and tlv triple is phase in microblogging content of text
Three adjacent segmentation word sections wiComposition, is designated as w3=w1w2w3, it is designated as w by unified to two tuples and tlv triplen, these two tuples and three
Tuple i.e. constitutes candidate's neologisms.
Step 7: for each candidate neologisms wn, utilize the method identical with statistics word section frequency table before, respectively base
In its place content of text and two dimensions of corresponding user information, adding up its word frequency information, result is designated as F,Wherein, result F obtained be comprise the set of likely candidate's neologisms.Further
Ground, whether method based on statistical learning, calculating each word in F according to its text word frequency information and user's service condition is
Neologisms.Being primarily based on strengthening mutual information EMI theoretical, calculate each word EMI based on text frequency score, formula is as follows:
Wherein,WithRepresent word w respectivelynWithThe frequency based on microblogging text, T is the sum of microblogging
Mesh, n is parameter n (n=2 or 3) in n tuple.Word wnEMI obtain the highest, represent composition this word wnEach segmentation
Word section has higher relatedness, then this word wnMore it is likely to be network neologisms.
Then, utilizing word distributed intelligence between users to calculate the EMI score between user, formula is as follows:
Wherein,WithRepresent word w respectivelynWithBased on user use the frequency, TuIt it is the sum of user
Mesh, n is parameter n (n=2 or 3) in n tuple.Word wnUser EMI obtain the highest, represent this word wnMay be more
User used, and have higher relatedness among different users, then this word wnMore it is likely to be popular net
Network neologisms.
Finally, by candidate neologisms wnRelatedness score ascore be defined as:
ascore(wn)=EMI (wn)+usrEMI(wn),
Wherein, for a candidate neologisms wn, its relatedness score is the highest, illustrates to form each segmentation word of this word
Section uses at microblogging text and user has higher relatedness in two dimensions.Simultaneously because this word wnNot by participle instrument
It is correctly detected, therefore wnIt is probably from microblogging the user-defined popular vocabulary found, i.e. network neologisms.
According to priori, " neologisms " be everybody receptible emerging, have certain semantic and be not present in passing
Word in system dictionary, therefore neologisms must be widely used by a lot of different users.The frequency information of word and its relatedness
Score can well reflect These parameters, if therefore word wnRelatedness score more than relatedness score threshold, and
Word wnThe frequency also greater than frequency threshold value, then by word wnAdd candidate neologisms list W.
Above step 7 is an iteration of the method, constantly repeats above step, until certain an iteration is not the most produced
The word of tissue regeneration promoting adds neologisms list W, and does not also have the word in neologisms list to be deleted, then terminate iterative process, now
Neologisms list W in each item be the present invention method extract network neologisms.
To sum up, the method has a characteristic that the word distribution letter that make use of in microblog users dimension in the present embodiment
Breath.Relative to traditional method, the method utilizes statistical method, theoretical based on strengthening mutual information (EMI), not only exists neologisms
Distribution in content of text dimension is analyzed, and utilizes the characteristic of this network information carriers of microblogging simultaneously, analyzes neologisms and exists
Use distribution situation between different user, this point can relatively significantly promote the accuracy rate of the neologisms that the method finds.Separately
Outward, this method establishes a kind of iterative computation algorithm and carries out neologisms and automatically extract step, and is different from traditional method and directly carries out
Calculate.First, this point can be effectively reduced the Time & Space Complexity of the method.Original neologisms based on EMI detection
In algorithm, in order to once find all of neologisms, need to find the n tuple of high-order, i.e. find the company being arbitrarily not more than n in text
The combination of continuous segmentation word section.However as the increase of n, the quantity of candidate word presents exponential increase, to internal memory and the consumption of time
Also it is increased dramatically.And the method uses the mode of iteration, can only use two tuples and tlv triple in each iterative process, it
Finding longer word combination by union operation repeatedly afterwards, therefore the method can effectively reduce the demand to internal memory, tool
There is relatively low Time & Space Complexity.On the other hand, in the step of iteration each time, the candidate recognized can be produced new
Word, before the method utilizes these candidate's neologisms to optimize participle next time and operates, and then the word segmentation result after utilization optimization filters
Underproof item in the new set of words found, this point can promote the accuracy rate of the neologisms that the method finds further.
In the description of this specification, reference term " embodiment ", " some embodiments ", " example ", " specifically show
Example " or the description of " some examples " etc. means to combine this embodiment or example describes specific features, structure, material or spy
Point is contained at least one embodiment or the example of the present invention.In this manual, to the schematic representation of above-mentioned term not
Necessarily refer to identical embodiment or example.And, the specific features of description, structure, material or feature can be any
One or more embodiments or example in combine in an appropriate manner.
Although an embodiment of the present invention has been shown and described, it will be understood by those skilled in the art that: not
These embodiments can be carried out multiple change in the case of departing from the principle of the present invention and objective, revise, replace and modification, this
The scope of invention is limited by claim and equivalent thereof.
Claims (5)
1. the method that network neologisms based on massive micro-blog text and user profile automatically extract, it is characterised in that include
Following steps:
S1: obtaining microblog data, wherein, described microblog data includes microblogging text and author identifier corresponding to microblogging;
S2: setting up neologisms list, wherein, described neologisms list initialization is empty set;
S3: described neologisms list is added in the Chinese analysis instrument preset, and according to described Chinese analysis instrument to described micro-
Blog article originally carries out participle operation, so that described microblogging text to be mapped to split the set of word section, and adds up each segmentation word respectively
Section is based on text and the word frequency information of two dimensions of user;
S4: update the word frequency information of corresponding word in described neologisms list according to the word frequency information obtained, and by little for the word frequency
Word in the first frequency threshold value is deleted from described neologisms list;
S5: in being operated by participle, n the segmentation word section occurred continuously is defined as n tuple, all of binary in statistics microblog data
Group and tlv triple, and using described two tuples and tlv triple as candidate's neologisms;
S6: according to described candidate's neologisms in text and the distribution of two dimensions of user, add up described candidate's neologisms based on text and
The word frequency information of two dimensions of user, and calculate the relatedness score of described candidate's neologisms;
S7: the word frequency in described candidate's neologisms is more than more than the second frequency threshold value and relatedness score the word of score threshold
Add to described neologisms list;And
S8: iteration performs described S2 to S7, until not having new candidate's neologisms to produce in described microblog data and described neologisms row
Table does not have candidate's neologisms be deleted.
The method that network neologisms based on massive micro-blog text and user profile the most according to claim 1 automatically extract,
It is characterized in that, in described S4, on the basis of described microblogging text, carry out neologisms certainly according to the user profile of microblog data
Dynamic extraction.
The method that network neologisms based on massive micro-blog text and user profile the most according to claim 1 automatically extract,
It is characterized in that, wherein, in described S8, merge the segmentation word section in word segmentation result by iterative computation algorithm iteration formula,
Wherein, only need to search two tuples in microblog data and tlv triple in each iterative process.
The method that network neologisms based on massive micro-blog text and user profile the most according to claim 3 automatically extract,
It is characterized in that, in described S8, also include:
After each iteration completes, it may be found that neologisms add described neologisms list, and using described neologisms list as the default Chinese
The User Defined dictionary of language analytical tool, in operating at upper once participle, the neologisms that will find in last iterative process
Correctly divide.
The method that network neologisms based on massive micro-blog text and user profile the most according to claim 1 automatically extract,
It is characterized in that, described S6 farther includes:
Theoretical based on strengthening mutual information EMI, calculate each word EMI based on text frequency score, particularly as follows:
Wherein,WithRepresent word w respectivelynWithThe frequency based on microblogging text, T is the total number of microblogging, and n is n
Parameter n in tuple, n=2 or 3;
The EMI score between user is calculated according to word distributed intelligence between users, particularly as follows:
Wherein,WithRepresent word w respectivelynWithBased on user use the frequency, TuBeing the total number of user, n is
Parameter n in n tuple, n=2 or 3;
The relatedness score of candidate's neologisms is obtained according to the EMI score between described EMI score based on the text frequency and user,
Particularly as follows:
ascore(wn)=EMI (wn)+usrEMI(wn),
Wherein, ascore (wn) it is candidate neologisms wnRelatedness score.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610324541.9A CN105956158B (en) | 2016-05-17 | 2016-05-17 | The method that network neologisms based on massive micro-blog text and user information automatically extract |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610324541.9A CN105956158B (en) | 2016-05-17 | 2016-05-17 | The method that network neologisms based on massive micro-blog text and user information automatically extract |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105956158A true CN105956158A (en) | 2016-09-21 |
CN105956158B CN105956158B (en) | 2019-08-09 |
Family
ID=56912577
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610324541.9A Active CN105956158B (en) | 2016-05-17 | 2016-05-17 | The method that network neologisms based on massive micro-blog text and user information automatically extract |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105956158B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106528523A (en) * | 2016-09-22 | 2017-03-22 | 中山大学 | Network neologism identification method |
CN107992501A (en) * | 2016-10-27 | 2018-05-04 | 腾讯科技(深圳)有限公司 | Social network information recognition methods, processing method and processing device |
CN108509425A (en) * | 2018-04-10 | 2018-09-07 | 中国人民解放军陆军工程大学 | A kind of Chinese new word discovery method based on novel degree |
WO2019085335A1 (en) * | 2017-11-01 | 2019-05-09 | 平安科技(深圳)有限公司 | Method for discovering investment objects with new words, device and storage medium |
WO2021027085A1 (en) * | 2019-08-15 | 2021-02-18 | 苏州朗动网络科技有限公司 | Method and device for automatically extracting text keyword, and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2005135311A (en) * | 2003-10-31 | 2005-05-26 | Nippon Telegr & Teleph Corp <Ntt> | Category-classified new feature word ranking method, apparatus and program, and computer readable storage medium recorded with category-classified new feature word ranking program |
CN1924858A (en) * | 2006-08-09 | 2007-03-07 | 北京搜狗科技发展有限公司 | Method and device for fetching new words and input method system |
CN103678656A (en) * | 2013-12-23 | 2014-03-26 | 合肥工业大学 | Unsupervised automatic extraction method of microblog new words based on repeated word strings |
US20140288924A1 (en) * | 2008-06-06 | 2014-09-25 | Zi Corporation Of Canada, Inc. | Systems and methods for an automated personalized dictionary generator for portable devices |
-
2016
- 2016-05-17 CN CN201610324541.9A patent/CN105956158B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2005135311A (en) * | 2003-10-31 | 2005-05-26 | Nippon Telegr & Teleph Corp <Ntt> | Category-classified new feature word ranking method, apparatus and program, and computer readable storage medium recorded with category-classified new feature word ranking program |
CN1924858A (en) * | 2006-08-09 | 2007-03-07 | 北京搜狗科技发展有限公司 | Method and device for fetching new words and input method system |
US20140288924A1 (en) * | 2008-06-06 | 2014-09-25 | Zi Corporation Of Canada, Inc. | Systems and methods for an automated personalized dictionary generator for portable devices |
CN103678656A (en) * | 2013-12-23 | 2014-03-26 | 合肥工业大学 | Unsupervised automatic extraction method of microblog new words based on repeated word strings |
Non-Patent Citations (2)
Title |
---|
PENG F, FENG F, MCCALLUM A.: "Chinese segmentation and new word detection using conditional random fields", 《PROCEEDINGS OF THE 20TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL LINGUISTICS. ASSOCIATION FOR COMPUTATIONAL LINGUISTICS》 * |
李钝: "Internet中的新词识别", 《北京邮电大学学报》 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106528523A (en) * | 2016-09-22 | 2017-03-22 | 中山大学 | Network neologism identification method |
CN106528523B (en) * | 2016-09-22 | 2019-05-10 | 中山大学 | A kind of network new word identification method |
CN107992501A (en) * | 2016-10-27 | 2018-05-04 | 腾讯科技(深圳)有限公司 | Social network information recognition methods, processing method and processing device |
CN107992501B (en) * | 2016-10-27 | 2021-12-14 | 腾讯科技(深圳)有限公司 | Social network information identification method, processing method and device |
WO2019085335A1 (en) * | 2017-11-01 | 2019-05-09 | 平安科技(深圳)有限公司 | Method for discovering investment objects with new words, device and storage medium |
CN108509425A (en) * | 2018-04-10 | 2018-09-07 | 中国人民解放军陆军工程大学 | A kind of Chinese new word discovery method based on novel degree |
CN108509425B (en) * | 2018-04-10 | 2021-08-24 | 中国人民解放军陆军工程大学 | Chinese new word discovery method based on novelty |
WO2021027085A1 (en) * | 2019-08-15 | 2021-02-18 | 苏州朗动网络科技有限公司 | Method and device for automatically extracting text keyword, and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN105956158B (en) | 2019-08-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107766324B (en) | Text consistency analysis method based on deep neural network | |
CN108287922B (en) | Text data viewpoint abstract mining method fusing topic attributes and emotional information | |
Li et al. | Recursive deep models for discourse parsing | |
CN104915340B (en) | Natural language question-answering method and device | |
CN108874878A (en) | A kind of building system and method for knowledge mapping | |
CN108038205B (en) | Viewpoint analysis prototype system for Chinese microblogs | |
CN105528437B (en) | A kind of question answering system construction method extracted based on structured text knowledge | |
CN104008166B (en) | Dialogue short text clustering method based on form and semantic similarity | |
CN107463553A (en) | For the text semantic extraction, expression and modeling method and system of elementary mathematics topic | |
CN102253930B (en) | A kind of method of text translation and device | |
CN111190900B (en) | JSON data visualization optimization method in cloud computing mode | |
CN105512245A (en) | Enterprise figure building method based on regression model | |
CN106844346A (en) | Short text Semantic Similarity method of discrimination and system based on deep learning model Word2Vec | |
CN105956158A (en) | Automatic extraction method of network neologism on the basis of mass microblog texts and use information | |
CN109902302B (en) | Topic map generation method, device and equipment suitable for text analysis or data mining and computer storage medium | |
CN104778256B (en) | A kind of the quick of field question answering system consulting can increment clustering method | |
CN108363725A (en) | A kind of method of the extraction of user comment viewpoint and the generation of viewpoint label | |
CN103793501A (en) | Theme community discovery method based on social network | |
CN109522396B (en) | Knowledge processing method and system for national defense science and technology field | |
CN104699797A (en) | Webpage data structured analytic method and device | |
CN104346382B (en) | Use the text analysis system and method for language inquiry | |
Nakashole et al. | Real-time population of knowledge bases: opportunities and challenges | |
Sun et al. | Graph force learning | |
CN104572633A (en) | Method for determining meanings of polysemous word | |
Gupta et al. | Keyword extraction: a review |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |