CN110334345A

CN110334345A - New word discovery method

Info

Publication number: CN110334345A
Application number: CN201910519979.6A
Authority: CN
Inventors: 李慧; 王慧慧
Original assignee: Capital Normal University
Current assignee: Capital Normal University
Priority date: 2019-06-17
Filing date: 2019-06-17
Publication date: 2019-10-15

Abstract

A kind of new word discovery method, comprising the following steps: corpus is cleaned and saved；Cutting corpus simultaneously marks part of speech；Word frequency filtering and part of speech filtering；Construct repeat pattern set；Repeat pattern filtering is deleted；Remaining repeat pattern is neologisms.Filtering screening of the invention contains the judgment criterias such as word frequency, inner coupling degree, a left side (right side) adjacency information entropy, the right adjacent entropy of left neighbour, the left adjacent entropy of right neighbour, left adjacent right averagely adjacent entropy and right adjacent left mean comentropy, substantially increases the accuracy rate of neologisms.

Description

New word discovery method

Technical field

The present invention relates to intelligent interaction field more particularly to a kind of new word discovery methods and dress based on social media It sets.

Background technique

In the various fields of Chinese information processing, it is required to complete corresponding function based on dictionary.For example, in intelligent retrieval In system or Intelligent dialogue system, by participle, problem retrieval, similarity mode, answering for search result or Intelligent dialogue is determined Case etc., wherein it is that minimum unit is calculated that each process, which is by word, the basis of calculating is word dictionary, so word Dictionary has the performance of intelligence system very big influence.

With internet flourish, appearances of the social medias platform such as microblogging, wechat change people in the past Conventional AC interaction mode, netizen deliver the viewpoint of oneself in multiple network platform, and online friend tends to using network neologisms, most Content and comment be often partially colloquial expression, therefore have many neologisms by online friends create come and can be with cracking speed It is propagated on network.And the word dictionary that whether can timely update after neologisms appearance, to the Intelligent dialogue system where word dictionary The system effectiveness of system has conclusive influence.

The method of new word discovery can be divided into two classes at present: the method based on classification and the method based on mark.Based on point The method of class is that candidate character strings are first extracted from corpus, then judges that candidate character strings are not again according to rule or statistical information It is neologisms.Method based on mark is then that new word discovery is combined with Chinese word segmentation, finds neologisms on the basis of participle.But mesh Preceding new word discovery method, such as patent 201510706254.X, 201810409087.6,201810409083.8, exist with Lower disadvantage: limitation word length, which will lead to a part of neologisms, in participle unit to be called back；Characteristic parameter is not complete enough in computing unit Face will lead to the reduction of neologisms accuracy rate.

For the recall rate and accuracy rate for improving neologisms, the present invention proposes a kind of new word discovery method, mixes both the above side Method carries out new word discovery according to rule and statistical information on the basis of Chinese word segmentation.

Summary of the invention

Present invention solves the technical problem that being how to promote the accuracy of new word discovery.

In order to solve the above technical problems, the present invention provides a kind of new word discovery method, comprising the following steps:

S1: corpus is cleaned and is saved；

S2: cutting corpus simultaneously marks part of speech；

S3: word frequency filtering and part of speech filtering；

S4: building repeat pattern set；

S5: repeat pattern filtering is deleted；

S6: remaining repeat pattern is neologisms.

Further, the corpus clean and save include: according to corpus cleaning rule to experiment corpus clean, and Corpus in corpus is saved as unit of item by row.

Further, the cutting corpus and to mark part of speech include: using NLPIR tool and user-oriented dictionary to microblogging corpus Part of speech is segmented and is marked, the corpus after obtaining part-of-speech tagging.Further, word frequency filtering and part of speech filtering include with Lower step:

S31: the frequency that each word occurs in the corpus after statistics part-of-speech tagging, according to pre-set word frequency threshold, Low-frequency word is put into filtering vocabulary, and high frequency words are added in initial candidate list；

S32: part of speech set is filtered in building, and word part of speech whether there is in filtering part of speech in the corpus after judging part-of-speech tagging In set, and if it exists, add it in filtering vocabulary, be otherwise added into initial candidate list.

Further, the building repeat pattern set includes: to loop through initial candidate list L₀, it is initial to get some Candidate word is superimposed word on the right side of it on initial candidate word, and if right side, word is not present in being superimposed if in filtering vocabulary, obtains It is added into after repeated strings 1 in repeat pattern list R, continues to be superimposed the word on the right side of it on the basis of repeated strings 1, if right Side word is not present in then being superimposed in filtering vocabulary, is added into repeat pattern list R after obtaining repeated strings 2；Above-mentioned superposition Process stops when right side word encounters punctuation mark or filters the word in vocabulary, and then obtains repeat pattern list.

Further, repeat pattern filtering delete the following steps are included:

S51: the repeat pattern list of acquisition is filtered using the basic dictionary of integration, if repeat pattern is present in base Deletion is then filtered in plinth dictionary；

S52: calculating the word frequency of repeat pattern, according to pre-set word frequency threshold, is filtered to repeat pattern word frequency, Repeat pattern lower than threshold value is deleted.

Further, the repeat pattern filtering, which is deleted, further includes steps of

S53: calculating the inner coupling degree of repeat pattern, according to pre-set threshold value, is filtered to repeat pattern, Repeat pattern lower than threshold value is deleted；

S54: counting the left adjacent character collection and right adjacent character collection of repeat pattern, adjacent according to pre-set left (right side) Entropy threshold filters the repeat pattern lower than left (right side) adjacent entropy threshold and deletes.

S55: the right adjacent character collection and each right adjacent character of each left adjacent character of repeat pattern are counted Left adjacent character collection, according to pre-set left (right side) adjacent right (left side) averagely adjacent entropy threshold, for lower than averagely adjacent entropy threshold The repeat pattern of value, which filters, to be deleted.

S56: it is filtered using the repeat pattern that Chinese collocations library obtains previous step, if repeat pattern is present in Deletion is then filtered in Chinese collocations library.

New word discovery method of the invention, the system of use include corpus pretreatment unit, participle unit, screening and filtering list Member, repeat pattern construction unit, statistical information computing unit；Wherein corpus pretreatment unit is cleaned and is saved to corpus；Participle Unit cutting corpus simultaneously marks part of speech；Screening and filtering unit carries out word frequency filtering to candidate word and part of speech filters；Repeat pattern structure It builds unit and repeat pattern set is constructed to candidate word；Statistical information computing unit calculates the ginseng such as inner coupling degree of repeat pattern Number, and it is filtered deletion.

Compared with prior art, the present invention carries out new word discovery using the rule-based method combined with statistics, has Below the utility model has the advantages that

First, the present invention segments experiment corpus using Chinese word segmentation tool in participle unit, integration joined User-oriented dictionary, ensure that the accuracy of word segmentation to greatest extent, and then ensure that the accuracy rate of neologisms.

Second, the present invention not only constructs deactivated part of speech set in screening and filtering unit, also incorporates multiple dictionaries and make For the basic dictionary in backstage.

Third, containing word frequency, inner coupling degree, a left side (right side) adjacency information entropy, left neighbour in filtering screening in the present invention The judgment criterias such as right adjacent entropy, the left adjacent entropy of right neighbour, left adjacent right averagely adjacent entropy and right adjacent left mean comentropy, substantially increase The accuracy rate of neologisms.

Detailed description of the invention

Fig. 1 is the flow chart of new word discovery method of the invention.

Specific embodiment

It is understandable to enable above-mentioned purpose of the invention, feature and beneficial effect to become apparent, 1 pair of sheet with reference to the accompanying drawing The specific embodiment of invention is described in detail.

Firstly, being cleaned according to corpus cleaning rule to experiment corpus, and the corpus in corpus as unit of item It is saved by row.

Secondly, sequence reads every a line, using NLPIR tool and user-oriented dictionary cutting microblogging corpus and part of speech is marked, is obtained Corpus after taking part-of-speech tagging.

Then, the frequency that each word occurs in the corpus after counting part-of-speech tagging, according to pre-set word frequency threshold, Low-frequency word is put into filtering vocabulary, and high frequency words are added in initial candidate list.

Next, part of speech set is filtered in building, word part of speech whether there is in filtering in the corpus after judging part-of-speech tagging In part of speech set, and if it exists, add it in filtering vocabulary；If it does not exist, it is added into initial candidate list, under One step.

Example: it in original microblogging corpus, chooses certain microblogging and " eats a native etymology in double 11 shopping Carnivals, online friend Laugh at oneself during shopping because of cost excess budget and eat soil next month, to describe a kind of mad degree to shopping at network." Material segmentation and part-of-speech tagging result after segmenting for the first time are as follows: eat/v soil/mono-/m of n word/n is derived from/bis- 11/nz shopping/vn of v Carnival/n ,/wd online friend/n/m in/p shopping/vi/ude1 process/n/n because/p cost/n excess budget/n laughs at oneself/vi under A month/nz eats/v soil/n, and/wd comes/vf describes/and v mistake/vf is right/p network/n shopping/vi/mono-/m of ude1 kind/v madness/a journey Degree/n./wj.According to word frequency, by the low-frequency word in cutting corpus --- " process ", " cost ", " next month ", " describing ", " crazy It is mad ", " degree " be added to filtering vocabulary in.According to part of speech, by cutting corpus " one ", ", ", " ", " ", " ", " because ", " next ", " mistake ", " to " are added in filtering vocabulary.

S5: building repeat pattern set, if there are character string and not being punctuation mark on the right side of current candidate word, further Judge that current string whether there is in above-mentioned filtering set of words and filtering part of speech set, if not existing, is waited to current It selects word and current string to be combined to obtain repeat pattern, and then obtains repeat pattern list.

Specifically, initial candidate list L is looped through₀, some initial candidate word is got, is superimposed it on initial candidate word Right side word, if right side, word is not present in being superimposed if in filtering vocabulary, is added into repeat pattern column after obtaining repeated strings 1 In table R, continue to be superimposed the word on the right side of it on the basis of repeated strings 1, if if right side, word is not present in filtering vocabulary Superposition, is added into repeat pattern list R after obtaining repeated strings 2；Above-mentioned additive process encounters punctuate symbol until right side word Number or filtering vocabulary in word when stopping.

Example: it is directed to cutting corpus, constructs the process of repeat pattern are as follows: " eating soil " is constructed first since " eating ", due to " one " then stops iteration in filtering vocabulary；Then since " word " building " etymology in ", " etymology in double 11 ", " etymology in Double 11 shopping ", " being derived from double 11 ", " derived from double 11 shopping ", " are derived from double ten at " etymology is in double 11 shopping Carnivals " One shopping Carnival ", " double 11 shopping ", " double 11 shopping Carnivals ", " shopping Carnival ", since ", " is in filtering vocabulary In then stop iteration；Due to " ", " ", " ", " process ", " in ", " because ", " cost " in filtering vocabulary, from " super pre- Calculate " start to construct " excess budget is laughed at oneself ", and because " next month " stop iteration in filtering vocabulary；It constructs and " eats since " eating " Soil ", since ", ", " next ", " describing ", " mistake ", " to " stop iteration in filtering vocabulary；Building " the network purchase since " network " Object ", due to " ", " one ", " madness ", " degree " filtering vocabulary in stop iteration, repeat pattern building process terminates.

S6: the repeat pattern list that previous step obtains is filtered using the basic dictionary of integration, if repeat pattern is deposited It is then to filter deletion in basic dictionary.If it does not exist, into S7.

Example: it is filtered using result of the basic dictionary to repeat pattern, obtains candidate neologisms are as follows: " eating soil ", " etymology In ", " etymology is in double 11 "；" etymology in double 11 shopping ", " etymology is in double 11 shopping Carnivals ", " derived from double 11 ", " derived from double 11 shopping ", " derived from double 11 shopping Carnivals ", " double 11 shopping ", " double 11 shopping Carnivals ", " shopping Carnival ", " excess budget is laughed at oneself ", " shopping at network ".

S7: the word frequency of repeat pattern is calculated.According to pre-set word frequency threshold, repeat pattern word frequency is filtered, Repeat pattern lower than threshold value is deleted.

Example: filter out following candidate word by calculating word frequency: " etymology in ", " etymology is in double 11 "；" etymology Yu Shuanshi One shopping ", " derived from double 11 ", " being derived from double 11 shopping ", " is purchased " etymology is in double 11 shopping Carnivals " derived from double 11 Object Carnival ", " excess budget is laughed at oneself ", " shopping at network ".

S8: the inner coupling degree of repeat pattern is calculated.All substrings of exhaustive repeat pattern, and internal coupling is carried out to substring Right calculating acquires the value of the repeat pattern inner coupling degree by formula (1).According to pre-set threshold value, to repetition mould Formula is filtered, and the repeat pattern lower than threshold value is deleted.

Wherein, word inner tight degree can be measured by inner coupling degree (Inside Coupling), definition is such as Under: all possibility of two points of word strings are divided into word string w and combine { (w₁₁, w₁₂), (w₂₁, w₂₂)…(w_i1, w_i2)…(w_n1, w_n2) (example " Chinese " all possible combinations: (" China ", " people "), (" in ", " compatriots ") }, obtained IC (w) is known as word string The inner coupling degree of w.Wherein P (w) indicates that word string w in textview field D (original language material) probability of occurrence, passes through formula:

It calculates, N (w) indicates the number that w word string occurs in textview field D, N_DIndicate the total number of word of textview field.IC value is got over Greatly, illustrate that the degree of correlation between word string is higher, the cohesion of the word is higher；Conversely, IC value is smaller, illustrate the correlation between word string Degree is lower, and the cohesion of the word is lower.

Example: filter out following candidate word by calculating inner coupling degree: " double 11 shopping ", " double 11 shopping are rejoiced with wild excitement Section ", " shopping Carnival ".

S9: the left adjacent character collection and right adjacent character collection of repeat pattern are counted.It is found out respectively by formula (3) each heavy The adjacent entropy in a left side (right side) for complex pattern.According to pre-set left (right side) adjacent entropy threshold, for lower than left (right side) adjacent entropy threshold Repeat pattern filter delete.

Set C={ the c of word string w all individual characters for possibly being present at left (right side) side w in textview field D₁, c₂, c_i... c_n} A referred to as left side of w (right side) adjacent word collection.Formula is passed through to C:

The IE (w) being calculated is known as the comentropy of the adjacent word collection in a left side (right side) of w.Wherein n_iIndicate c_iA left side (right side) as w is adjacent The number that word occurs, n indicate that the sum of the number that all words in adjacent word collection C occur as the adjacent word in a left side (right side) of w, lb are indicated with 2 For the logarithm at bottom.

S10: the right adjacent character collection and each right adjacent character of each left adjacent character of repeat pattern are counted Left adjacent character collection.The right adjacent entropy of left neighbour and the left adjacent entropy of right neighbour that each word is calculated by formula (4), (5), pass through formula (6), (7) calculate the left adjacent right averagely adjacent entropy and the right adjacent entropy of adjacent left mean of each word.It is adjacent according to pre-set left (right side) Right (left side) averagely adjacent entropy threshold, filters the repeat pattern lower than averagely adjacent entropy threshold and deletes.

The left right adjacent entropy of neighbour:

Wherein x_iIndicate that the left adjacent word of candidate word, i indicate the left adjacent word number of candidate word, G_jIndicate the right side of the left adjacent word of candidate word Adjacent word, j are current x_iRight adjacent word number；P indicates to concentrate in the left adjacent word of candidate word with x_iAs the general of the left adjacent word of candidate word Rate.

The right left adjacent entropy of neighbour:

Wherein x_iIndicate that the right adjacent word of candidate word, i indicate the right adjacent word number of candidate word, G_jIndicate a left side for the right adjacent word of candidate word Adjacent word, j are current x_iLeft adjacent word number；P indicates to concentrate in the right adjacent word of candidate word using xi as the general of the right adjacent word of candidate word Rate.

Left adjacent right averagely adjacent entropy:

Wherein LRE (x_i) indicate candidate word the right adjacent entropy of left neighbour, m indicate the left neighbour of candidate word it is right neighbour word number.

The right adjacent entropy of neighbour's left mean:

Wherein RLE (x_i) indicate candidate word the left adjacent entropy of right neighbour, m indicate the right neighbour of candidate word it is left neighbour word number.

S11: it is filtered using the repeat pattern that Chinese collocations library obtains previous step, if repeat pattern is present in Deletion is then filtered in Chinese collocations library.

S12: remaining word is taken as neologisms.

Example: left adjacent comentropy, right adjacent comentropy, the right comentropy of left neighbour, the left comentropy of right neighbour, left adjacent right average neighbour are calculated Connect entropy, the right adjacent adjacent entropy of left mean, the candidate word not filtered in this corpus.Finally obtained neologisms are " eating soil ".

Although present disclosure is as above, present invention is not limited to this.Anyone skilled in the art are not departing from this It in the spirit and scope of invention, can make various changes or modifications, therefore protection scope of the present invention should be with claim institute Subject to the range of restriction.

Claims

1. a kind of new word discovery method, which comprises the following steps:

S1: corpus is cleaned and is saved；

S2: cutting corpus simultaneously marks part of speech；

S3: word frequency filtering and part of speech filtering；

S4: building repeat pattern set；

S5: repeat pattern filtering is deleted；

S6: remaining repeat pattern is neologisms.

2. new word discovery method according to claim 1, which is characterized in that the corpus clean and save include: according to Corpus cleaning rule cleans experiment corpus, and the corpus in corpus is saved as unit of item by row.

3. new word discovery method according to claim 2, which is characterized in that the cutting corpus simultaneously marks part of speech and includes: Part of speech is segmented and marked to microblogging corpus using NLPIR tool and user-oriented dictionary, the corpus after obtaining part-of-speech tagging.

4. new word discovery method according to claim 3, which is characterized in that word frequency filtering and part of speech filtering include with Lower step:

S31: the frequency that each word occurs in the corpus after statistics part-of-speech tagging, according to pre-set word frequency threshold, low frequency Word is put into filtering vocabulary, and high frequency words are added in initial candidate list；

S32: part of speech set is filtered in building, and word part of speech whether there is in filtering part of speech set in the corpus after judging part-of-speech tagging In, and if it exists, it adds it in filtering vocabulary, is otherwise added into initial candidate list.

5. new word discovery method according to claim 4, which is characterized in that the building repeat pattern set includes: to follow Ring traverses initial candidate list L₀, some initial candidate word is got, word on the right side of it is superimposed on initial candidate word, if right side Word is not present in then being superimposed in filtering vocabulary, is added into repeat pattern list R after obtaining repeated strings 1；Continue repeating The word being superimposed on the right side of it on the basis of string 1, if right side, word is not present in being superimposed if in filtering vocabulary, obtains repeated strings 2 After be added into repeat pattern list R；Above-mentioned additive process encounters in punctuation mark or filtering vocabulary until right side word Stop when word, and then obtains repeat pattern list.

6. new word discovery method according to claim 5, which is characterized in that it includes following that the repeat pattern filtering, which is deleted, Step:

S51: the repeat pattern list of acquisition is filtered using the basic dictionary of integration, if repeat pattern is present in basic word Deletion is then filtered in allusion quotation；

S52: calculating the word frequency of repeat pattern, according to pre-set word frequency threshold, is filtered to repeat pattern word frequency, low It is deleted in the repeat pattern of threshold value.

7. new word discovery method according to claim 6, which is characterized in that further packet is deleted in the repeat pattern filtering Include following steps:

S53: calculating the inner coupling degree of repeat pattern, according to pre-set threshold value, is filtered to repeat pattern, being lower than The repeat pattern of threshold value is deleted；

S54: counting the left adjacent character collection and right adjacent character collection of repeat pattern, according to pre-set left (right side) adjacent entropy threshold Value filters the repeat pattern lower than left (right side) adjacent entropy threshold and deletes.

S55: the right adjacent character collection of each left adjacent character of repeat pattern and the left neighbour of each right adjacent character are counted Character set is connect, according to pre-set left (right side) adjacent right (left side) averagely adjacent entropy threshold, for lower than averagely adjacent entropy threshold Repeat pattern filtering is deleted.

S56: it is filtered using the repeat pattern that Chinese collocations library obtains previous step, if repeat pattern is present in Chinese Deletion is then filtered in collocations library.

8. new word discovery method according to claim 7, which is characterized in that wherein calculate the inner coupling degree of repeat pattern Include: all substrings of exhaustive repeat pattern, and inner coupling degree calculating is carried out to substring, which is acquired by formula (1) The value of the mode internal degree of coupling:

Wherein, word inner tight degree can be measured by inner coupling degree, is defined as follows: two are divided into word string w The possibility for dividing word string all combines { (w₁₁, w₁₂), (w₂₁, w₂₂)…(w_i1, w_i2)…(w_n1, w_n2), obtained IC (w) is known as word The inner coupling degree of string w；

Wherein P (w) indicates that word string w in textview field D probability of occurrence, passes through formula (2):

It calculates, N (w) indicates the number that w word string occurs in textview field D, N_DIndicate the total number of word of textview field.IC value is bigger, explanation Degree of correlation between word string is higher, and the cohesion of the word is higher；Conversely, IC value is smaller, illustrate that the degree of correlation between word string is got over Low, the cohesion of the word is lower.

9. new word discovery method according to claim 8, which is characterized in that count repeat pattern left adjacent character collection and Right adjacent character collection includes: the adjacent entropy in a left side (right side) for finding out each repeat pattern respectively by formula (3):

Set C={ the c of word string w all individual characters for possibly being present at left (right side) side w in textview field D₁, c₂, c_i... c_nIt is known as w The adjacent word collection in a left side (right side).Formula (3) are passed through to C:

The IE (w) being calculated is known as the comentropy of the adjacent word collection in a left side (right side) of w.Wherein n_iIndicate c_iThe adjacent word in a left side (right side) as w goes out Existing number, n indicate that the sum of the number that all words in adjacent word collection C occur as the adjacent word in a left side (right side) of w, lb indicate with 2 to be bottom Logarithm.

10. new word discovery method according to claim 9, which is characterized in that count each left adjoining of repeat pattern The right adjacent character collection of character and the left adjacent character collection of each right adjacent character include: to be calculated often by formula (4), (5) The right adjacent entropy of left neighbour and the left adjacent entropy of right neighbour of a word calculate the left adjacent right averagely adjacent entropy of each word by formula (6), (7) Entropy is abutted with right adjacent left mean:

The left right adjacent entropy of neighbour:

Wherein x_iIndicate that the left adjacent word of candidate word, i indicate the left adjacent word number of candidate word, G_jIndicate the right adjacent word of the left adjacent word of candidate word, J is current x_iRight adjacent word number；P indicates to concentrate in the left adjacent word of candidate word with x_iProbability as the left adjacent word of candidate word；

The right left adjacent entropy of neighbour:

Wherein x_iIndicate that the right adjacent word of candidate word, i indicate the right adjacent word number of candidate word, G_jIndicate the left adjacent word of the right adjacent word of candidate word, J is current x_iLeft adjacent word number；P indicates to concentrate the probability using xi as the right adjacent word of candidate word in the right adjacent word of candidate word；

Left adjacent right averagely adjacent entropy:

Wherein LRE (x_i) indicate candidate word the right adjacent entropy of left neighbour, m indicate the left neighbour of candidate word it is right neighbour word number；

The right adjacent entropy of neighbour's left mean:

11. a kind of new word discovery system, including the building of corpus pretreatment unit, participle unit, screening and filtering unit, repeat pattern Unit, statistical information computing unit；Wherein corpus pretreatment unit is cleaned and is saved to corpus；Participle unit cutting corpus is simultaneously marked Infuse part of speech；Screening and filtering unit carries out word frequency filtering to candidate word and part of speech filters；Repeat pattern construction unit is to candidate word structure Build repeat pattern set；Statistical information computing unit calculates the parameters such as the inner coupling degree of repeat pattern, and is filtered deletion.