CN110334345A - New word discovery method - Google Patents

New word discovery method Download PDF

Info

Publication number
CN110334345A
CN110334345A CN201910519979.6A CN201910519979A CN110334345A CN 110334345 A CN110334345 A CN 110334345A CN 201910519979 A CN201910519979 A CN 201910519979A CN 110334345 A CN110334345 A CN 110334345A
Authority
CN
China
Prior art keywords
word
adjacent
repeat pattern
filtering
corpus
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910519979.6A
Other languages
Chinese (zh)
Inventor
李慧
王慧慧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Capital Normal University
Original Assignee
Capital Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Capital Normal University filed Critical Capital Normal University
Priority to CN201910519979.6A priority Critical patent/CN110334345A/en
Publication of CN110334345A publication Critical patent/CN110334345A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Machine Translation (AREA)

Abstract

A kind of new word discovery method, comprising the following steps: corpus is cleaned and saved;Cutting corpus simultaneously marks part of speech;Word frequency filtering and part of speech filtering;Construct repeat pattern set;Repeat pattern filtering is deleted;Remaining repeat pattern is neologisms.Filtering screening of the invention contains the judgment criterias such as word frequency, inner coupling degree, a left side (right side) adjacency information entropy, the right adjacent entropy of left neighbour, the left adjacent entropy of right neighbour, left adjacent right averagely adjacent entropy and right adjacent left mean comentropy, substantially increases the accuracy rate of neologisms.

Description

New word discovery method
Technical field
The present invention relates to intelligent interaction field more particularly to a kind of new word discovery methods and dress based on social media It sets.
Background technique
In the various fields of Chinese information processing, it is required to complete corresponding function based on dictionary.For example, in intelligent retrieval In system or Intelligent dialogue system, by participle, problem retrieval, similarity mode, answering for search result or Intelligent dialogue is determined Case etc., wherein it is that minimum unit is calculated that each process, which is by word, the basis of calculating is word dictionary, so word Dictionary has the performance of intelligence system very big influence.
With internet flourish, appearances of the social medias platform such as microblogging, wechat change people in the past Conventional AC interaction mode, netizen deliver the viewpoint of oneself in multiple network platform, and online friend tends to using network neologisms, most Content and comment be often partially colloquial expression, therefore have many neologisms by online friends create come and can be with cracking speed It is propagated on network.And the word dictionary that whether can timely update after neologisms appearance, to the Intelligent dialogue system where word dictionary The system effectiveness of system has conclusive influence.
The method of new word discovery can be divided into two classes at present: the method based on classification and the method based on mark.Based on point The method of class is that candidate character strings are first extracted from corpus, then judges that candidate character strings are not again according to rule or statistical information It is neologisms.Method based on mark is then that new word discovery is combined with Chinese word segmentation, finds neologisms on the basis of participle.But mesh Preceding new word discovery method, such as patent 201510706254.X, 201810409087.6,201810409083.8, exist with Lower disadvantage: limitation word length, which will lead to a part of neologisms, in participle unit to be called back;Characteristic parameter is not complete enough in computing unit Face will lead to the reduction of neologisms accuracy rate.
For the recall rate and accuracy rate for improving neologisms, the present invention proposes a kind of new word discovery method, mixes both the above side Method carries out new word discovery according to rule and statistical information on the basis of Chinese word segmentation.
Summary of the invention
Present invention solves the technical problem that being how to promote the accuracy of new word discovery.
In order to solve the above technical problems, the present invention provides a kind of new word discovery method, comprising the following steps:
S1: corpus is cleaned and is saved;
S2: cutting corpus simultaneously marks part of speech;
S3: word frequency filtering and part of speech filtering;
S4: building repeat pattern set;
S5: repeat pattern filtering is deleted;
S6: remaining repeat pattern is neologisms.
Further, the corpus clean and save include: according to corpus cleaning rule to experiment corpus clean, and Corpus in corpus is saved as unit of item by row.
Further, the cutting corpus and to mark part of speech include: using NLPIR tool and user-oriented dictionary to microblogging corpus Part of speech is segmented and is marked, the corpus after obtaining part-of-speech tagging.Further, word frequency filtering and part of speech filtering include with Lower step:
S31: the frequency that each word occurs in the corpus after statistics part-of-speech tagging, according to pre-set word frequency threshold, Low-frequency word is put into filtering vocabulary, and high frequency words are added in initial candidate list;
S32: part of speech set is filtered in building, and word part of speech whether there is in filtering part of speech in the corpus after judging part-of-speech tagging In set, and if it exists, add it in filtering vocabulary, be otherwise added into initial candidate list.
Further, the building repeat pattern set includes: to loop through initial candidate list L0, it is initial to get some Candidate word is superimposed word on the right side of it on initial candidate word, and if right side, word is not present in being superimposed if in filtering vocabulary, obtains It is added into after repeated strings 1 in repeat pattern list R, continues to be superimposed the word on the right side of it on the basis of repeated strings 1, if right Side word is not present in then being superimposed in filtering vocabulary, is added into repeat pattern list R after obtaining repeated strings 2;Above-mentioned superposition Process stops when right side word encounters punctuation mark or filters the word in vocabulary, and then obtains repeat pattern list.
Further, repeat pattern filtering delete the following steps are included:
S51: the repeat pattern list of acquisition is filtered using the basic dictionary of integration, if repeat pattern is present in base Deletion is then filtered in plinth dictionary;
S52: calculating the word frequency of repeat pattern, according to pre-set word frequency threshold, is filtered to repeat pattern word frequency, Repeat pattern lower than threshold value is deleted.
Further, the repeat pattern filtering, which is deleted, further includes steps of
S53: calculating the inner coupling degree of repeat pattern, according to pre-set threshold value, is filtered to repeat pattern, Repeat pattern lower than threshold value is deleted;
S54: counting the left adjacent character collection and right adjacent character collection of repeat pattern, adjacent according to pre-set left (right side) Entropy threshold filters the repeat pattern lower than left (right side) adjacent entropy threshold and deletes.
S55: the right adjacent character collection and each right adjacent character of each left adjacent character of repeat pattern are counted Left adjacent character collection, according to pre-set left (right side) adjacent right (left side) averagely adjacent entropy threshold, for lower than averagely adjacent entropy threshold The repeat pattern of value, which filters, to be deleted.
S56: it is filtered using the repeat pattern that Chinese collocations library obtains previous step, if repeat pattern is present in Deletion is then filtered in Chinese collocations library.
New word discovery method of the invention, the system of use include corpus pretreatment unit, participle unit, screening and filtering list Member, repeat pattern construction unit, statistical information computing unit;Wherein corpus pretreatment unit is cleaned and is saved to corpus;Participle Unit cutting corpus simultaneously marks part of speech;Screening and filtering unit carries out word frequency filtering to candidate word and part of speech filters;Repeat pattern structure It builds unit and repeat pattern set is constructed to candidate word;Statistical information computing unit calculates the ginseng such as inner coupling degree of repeat pattern Number, and it is filtered deletion.
Compared with prior art, the present invention carries out new word discovery using the rule-based method combined with statistics, has Below the utility model has the advantages that
First, the present invention segments experiment corpus using Chinese word segmentation tool in participle unit, integration joined User-oriented dictionary, ensure that the accuracy of word segmentation to greatest extent, and then ensure that the accuracy rate of neologisms.
Second, the present invention not only constructs deactivated part of speech set in screening and filtering unit, also incorporates multiple dictionaries and make For the basic dictionary in backstage.
Third, containing word frequency, inner coupling degree, a left side (right side) adjacency information entropy, left neighbour in filtering screening in the present invention The judgment criterias such as right adjacent entropy, the left adjacent entropy of right neighbour, left adjacent right averagely adjacent entropy and right adjacent left mean comentropy, substantially increase The accuracy rate of neologisms.
Detailed description of the invention
Fig. 1 is the flow chart of new word discovery method of the invention.
Specific embodiment
It is understandable to enable above-mentioned purpose of the invention, feature and beneficial effect to become apparent, 1 pair of sheet with reference to the accompanying drawing The specific embodiment of invention is described in detail.
Firstly, being cleaned according to corpus cleaning rule to experiment corpus, and the corpus in corpus as unit of item It is saved by row.
Secondly, sequence reads every a line, using NLPIR tool and user-oriented dictionary cutting microblogging corpus and part of speech is marked, is obtained Corpus after taking part-of-speech tagging.
Then, the frequency that each word occurs in the corpus after counting part-of-speech tagging, according to pre-set word frequency threshold, Low-frequency word is put into filtering vocabulary, and high frequency words are added in initial candidate list.
Next, part of speech set is filtered in building, word part of speech whether there is in filtering in the corpus after judging part-of-speech tagging In part of speech set, and if it exists, add it in filtering vocabulary;If it does not exist, it is added into initial candidate list, under One step.
Example: it in original microblogging corpus, chooses certain microblogging and " eats a native etymology in double 11 shopping Carnivals, online friend Laugh at oneself during shopping because of cost excess budget and eat soil next month, to describe a kind of mad degree to shopping at network." Material segmentation and part-of-speech tagging result after segmenting for the first time are as follows: eat/v soil/mono-/m of n word/n is derived from/bis- 11/nz shopping/vn of v Carnival/n ,/wd online friend/n/m in/p shopping/vi/ude1 process/n/n because/p cost/n excess budget/n laughs at oneself/vi under A month/nz eats/v soil/n, and/wd comes/vf describes/and v mistake/vf is right/p network/n shopping/vi/mono-/m of ude1 kind/v madness/a journey Degree/n./wj.According to word frequency, by the low-frequency word in cutting corpus --- " process ", " cost ", " next month ", " describing ", " crazy It is mad ", " degree " be added to filtering vocabulary in.According to part of speech, by cutting corpus " one ", ", ", " ", " ", " ", " because ", " next ", " mistake ", " to " are added in filtering vocabulary.
S5: building repeat pattern set, if there are character string and not being punctuation mark on the right side of current candidate word, further Judge that current string whether there is in above-mentioned filtering set of words and filtering part of speech set, if not existing, is waited to current It selects word and current string to be combined to obtain repeat pattern, and then obtains repeat pattern list.
Specifically, initial candidate list L is looped through0, some initial candidate word is got, is superimposed it on initial candidate word Right side word, if right side, word is not present in being superimposed if in filtering vocabulary, is added into repeat pattern column after obtaining repeated strings 1 In table R, continue to be superimposed the word on the right side of it on the basis of repeated strings 1, if if right side, word is not present in filtering vocabulary Superposition, is added into repeat pattern list R after obtaining repeated strings 2;Above-mentioned additive process encounters punctuate symbol until right side word Number or filtering vocabulary in word when stopping.
Example: it is directed to cutting corpus, constructs the process of repeat pattern are as follows: " eating soil " is constructed first since " eating ", due to " one " then stops iteration in filtering vocabulary;Then since " word " building " etymology in ", " etymology in double 11 ", " etymology in Double 11 shopping ", " being derived from double 11 ", " derived from double 11 shopping ", " are derived from double ten at " etymology is in double 11 shopping Carnivals " One shopping Carnival ", " double 11 shopping ", " double 11 shopping Carnivals ", " shopping Carnival ", since ", " is in filtering vocabulary In then stop iteration;Due to " ", " ", " ", " process ", " in ", " because ", " cost " in filtering vocabulary, from " super pre- Calculate " start to construct " excess budget is laughed at oneself ", and because " next month " stop iteration in filtering vocabulary;It constructs and " eats since " eating " Soil ", since ", ", " next ", " describing ", " mistake ", " to " stop iteration in filtering vocabulary;Building " the network purchase since " network " Object ", due to " ", " one ", " madness ", " degree " filtering vocabulary in stop iteration, repeat pattern building process terminates.
S6: the repeat pattern list that previous step obtains is filtered using the basic dictionary of integration, if repeat pattern is deposited It is then to filter deletion in basic dictionary.If it does not exist, into S7.
Example: it is filtered using result of the basic dictionary to repeat pattern, obtains candidate neologisms are as follows: " eating soil ", " etymology In ", " etymology is in double 11 ";" etymology in double 11 shopping ", " etymology is in double 11 shopping Carnivals ", " derived from double 11 ", " derived from double 11 shopping ", " derived from double 11 shopping Carnivals ", " double 11 shopping ", " double 11 shopping Carnivals ", " shopping Carnival ", " excess budget is laughed at oneself ", " shopping at network ".
S7: the word frequency of repeat pattern is calculated.According to pre-set word frequency threshold, repeat pattern word frequency is filtered, Repeat pattern lower than threshold value is deleted.
Example: filter out following candidate word by calculating word frequency: " etymology in ", " etymology is in double 11 ";" etymology Yu Shuanshi One shopping ", " derived from double 11 ", " being derived from double 11 shopping ", " is purchased " etymology is in double 11 shopping Carnivals " derived from double 11 Object Carnival ", " excess budget is laughed at oneself ", " shopping at network ".
S8: the inner coupling degree of repeat pattern is calculated.All substrings of exhaustive repeat pattern, and internal coupling is carried out to substring Right calculating acquires the value of the repeat pattern inner coupling degree by formula (1).According to pre-set threshold value, to repetition mould Formula is filtered, and the repeat pattern lower than threshold value is deleted.
Wherein, word inner tight degree can be measured by inner coupling degree (Inside Coupling), definition is such as Under: all possibility of two points of word strings are divided into word string w and combine { (w11, w12), (w21, w22)…(wi1, wi2)…(wn1, wn2) (example " Chinese " all possible combinations: (" China ", " people "), (" in ", " compatriots ") }, obtained IC (w) is known as word string The inner coupling degree of w.Wherein P (w) indicates that word string w in textview field D (original language material) probability of occurrence, passes through formula:
It calculates, N (w) indicates the number that w word string occurs in textview field D, NDIndicate the total number of word of textview field.IC value is got over Greatly, illustrate that the degree of correlation between word string is higher, the cohesion of the word is higher;Conversely, IC value is smaller, illustrate the correlation between word string Degree is lower, and the cohesion of the word is lower.
Example: filter out following candidate word by calculating inner coupling degree: " double 11 shopping ", " double 11 shopping are rejoiced with wild excitement Section ", " shopping Carnival ".
S9: the left adjacent character collection and right adjacent character collection of repeat pattern are counted.It is found out respectively by formula (3) each heavy The adjacent entropy in a left side (right side) for complex pattern.According to pre-set left (right side) adjacent entropy threshold, for lower than left (right side) adjacent entropy threshold Repeat pattern filter delete.
Set C={ the c of word string w all individual characters for possibly being present at left (right side) side w in textview field D1, c2, ci... cn} A referred to as left side of w (right side) adjacent word collection.Formula is passed through to C:
The IE (w) being calculated is known as the comentropy of the adjacent word collection in a left side (right side) of w.Wherein niIndicate ciA left side (right side) as w is adjacent The number that word occurs, n indicate that the sum of the number that all words in adjacent word collection C occur as the adjacent word in a left side (right side) of w, lb are indicated with 2 For the logarithm at bottom.
S10: the right adjacent character collection and each right adjacent character of each left adjacent character of repeat pattern are counted Left adjacent character collection.The right adjacent entropy of left neighbour and the left adjacent entropy of right neighbour that each word is calculated by formula (4), (5), pass through formula (6), (7) calculate the left adjacent right averagely adjacent entropy and the right adjacent entropy of adjacent left mean of each word.It is adjacent according to pre-set left (right side) Right (left side) averagely adjacent entropy threshold, filters the repeat pattern lower than averagely adjacent entropy threshold and deletes.
The left right adjacent entropy of neighbour:
Wherein xiIndicate that the left adjacent word of candidate word, i indicate the left adjacent word number of candidate word, GjIndicate the right side of the left adjacent word of candidate word Adjacent word, j are current xiRight adjacent word number;P indicates to concentrate in the left adjacent word of candidate word with xiAs the general of the left adjacent word of candidate word Rate.
The right left adjacent entropy of neighbour:
Wherein xiIndicate that the right adjacent word of candidate word, i indicate the right adjacent word number of candidate word, GjIndicate a left side for the right adjacent word of candidate word Adjacent word, j are current xiLeft adjacent word number;P indicates to concentrate in the right adjacent word of candidate word using xi as the general of the right adjacent word of candidate word Rate.
Left adjacent right averagely adjacent entropy:
Wherein LRE (xi) indicate candidate word the right adjacent entropy of left neighbour, m indicate the left neighbour of candidate word it is right neighbour word number.
The right adjacent entropy of neighbour's left mean:
Wherein RLE (xi) indicate candidate word the left adjacent entropy of right neighbour, m indicate the right neighbour of candidate word it is left neighbour word number.
S11: it is filtered using the repeat pattern that Chinese collocations library obtains previous step, if repeat pattern is present in Deletion is then filtered in Chinese collocations library.
S12: remaining word is taken as neologisms.
Example: left adjacent comentropy, right adjacent comentropy, the right comentropy of left neighbour, the left comentropy of right neighbour, left adjacent right average neighbour are calculated Connect entropy, the right adjacent adjacent entropy of left mean, the candidate word not filtered in this corpus.Finally obtained neologisms are " eating soil ".
New word discovery method of the invention, the system of use include corpus pretreatment unit, participle unit, screening and filtering list Member, repeat pattern construction unit, statistical information computing unit;Wherein corpus pretreatment unit is cleaned and is saved to corpus;Participle Unit cutting corpus simultaneously marks part of speech;Screening and filtering unit carries out word frequency filtering to candidate word and part of speech filters;Repeat pattern structure It builds unit and repeat pattern set is constructed to candidate word;Statistical information computing unit calculates the ginseng such as inner coupling degree of repeat pattern Number, and it is filtered deletion.
Although present disclosure is as above, present invention is not limited to this.Anyone skilled in the art are not departing from this It in the spirit and scope of invention, can make various changes or modifications, therefore protection scope of the present invention should be with claim institute Subject to the range of restriction.

Claims (11)

1. a kind of new word discovery method, which comprises the following steps:
S1: corpus is cleaned and is saved;
S2: cutting corpus simultaneously marks part of speech;
S3: word frequency filtering and part of speech filtering;
S4: building repeat pattern set;
S5: repeat pattern filtering is deleted;
S6: remaining repeat pattern is neologisms.
2. new word discovery method according to claim 1, which is characterized in that the corpus clean and save include: according to Corpus cleaning rule cleans experiment corpus, and the corpus in corpus is saved as unit of item by row.
3. new word discovery method according to claim 2, which is characterized in that the cutting corpus simultaneously marks part of speech and includes: Part of speech is segmented and marked to microblogging corpus using NLPIR tool and user-oriented dictionary, the corpus after obtaining part-of-speech tagging.
4. new word discovery method according to claim 3, which is characterized in that word frequency filtering and part of speech filtering include with Lower step:
S31: the frequency that each word occurs in the corpus after statistics part-of-speech tagging, according to pre-set word frequency threshold, low frequency Word is put into filtering vocabulary, and high frequency words are added in initial candidate list;
S32: part of speech set is filtered in building, and word part of speech whether there is in filtering part of speech set in the corpus after judging part-of-speech tagging In, and if it exists, it adds it in filtering vocabulary, is otherwise added into initial candidate list.
5. new word discovery method according to claim 4, which is characterized in that the building repeat pattern set includes: to follow Ring traverses initial candidate list L0, some initial candidate word is got, word on the right side of it is superimposed on initial candidate word, if right side Word is not present in then being superimposed in filtering vocabulary, is added into repeat pattern list R after obtaining repeated strings 1;Continue repeating The word being superimposed on the right side of it on the basis of string 1, if right side, word is not present in being superimposed if in filtering vocabulary, obtains repeated strings 2 After be added into repeat pattern list R;Above-mentioned additive process encounters in punctuation mark or filtering vocabulary until right side word Stop when word, and then obtains repeat pattern list.
6. new word discovery method according to claim 5, which is characterized in that it includes following that the repeat pattern filtering, which is deleted, Step:
S51: the repeat pattern list of acquisition is filtered using the basic dictionary of integration, if repeat pattern is present in basic word Deletion is then filtered in allusion quotation;
S52: calculating the word frequency of repeat pattern, according to pre-set word frequency threshold, is filtered to repeat pattern word frequency, low It is deleted in the repeat pattern of threshold value.
7. new word discovery method according to claim 6, which is characterized in that further packet is deleted in the repeat pattern filtering Include following steps:
S53: calculating the inner coupling degree of repeat pattern, according to pre-set threshold value, is filtered to repeat pattern, being lower than The repeat pattern of threshold value is deleted;
S54: counting the left adjacent character collection and right adjacent character collection of repeat pattern, according to pre-set left (right side) adjacent entropy threshold Value filters the repeat pattern lower than left (right side) adjacent entropy threshold and deletes.
S55: the right adjacent character collection of each left adjacent character of repeat pattern and the left neighbour of each right adjacent character are counted Character set is connect, according to pre-set left (right side) adjacent right (left side) averagely adjacent entropy threshold, for lower than averagely adjacent entropy threshold Repeat pattern filtering is deleted.
S56: it is filtered using the repeat pattern that Chinese collocations library obtains previous step, if repeat pattern is present in Chinese Deletion is then filtered in collocations library.
8. new word discovery method according to claim 7, which is characterized in that wherein calculate the inner coupling degree of repeat pattern Include: all substrings of exhaustive repeat pattern, and inner coupling degree calculating is carried out to substring, which is acquired by formula (1) The value of the mode internal degree of coupling:
Wherein, word inner tight degree can be measured by inner coupling degree, is defined as follows: two are divided into word string w The possibility for dividing word string all combines { (w11, w12), (w21, w22)…(wi1, wi2)…(wn1, wn2), obtained IC (w) is known as word The inner coupling degree of string w;
Wherein P (w) indicates that word string w in textview field D probability of occurrence, passes through formula (2):
It calculates, N (w) indicates the number that w word string occurs in textview field D, NDIndicate the total number of word of textview field.IC value is bigger, explanation Degree of correlation between word string is higher, and the cohesion of the word is higher;Conversely, IC value is smaller, illustrate that the degree of correlation between word string is got over Low, the cohesion of the word is lower.
9. new word discovery method according to claim 8, which is characterized in that count repeat pattern left adjacent character collection and Right adjacent character collection includes: the adjacent entropy in a left side (right side) for finding out each repeat pattern respectively by formula (3):
Set C={ the c of word string w all individual characters for possibly being present at left (right side) side w in textview field D1, c2, ci... cnIt is known as w The adjacent word collection in a left side (right side).Formula (3) are passed through to C:
The IE (w) being calculated is known as the comentropy of the adjacent word collection in a left side (right side) of w.Wherein niIndicate ciThe adjacent word in a left side (right side) as w goes out Existing number, n indicate that the sum of the number that all words in adjacent word collection C occur as the adjacent word in a left side (right side) of w, lb indicate with 2 to be bottom Logarithm.
10. new word discovery method according to claim 9, which is characterized in that count each left adjoining of repeat pattern The right adjacent character collection of character and the left adjacent character collection of each right adjacent character include: to be calculated often by formula (4), (5) The right adjacent entropy of left neighbour and the left adjacent entropy of right neighbour of a word calculate the left adjacent right averagely adjacent entropy of each word by formula (6), (7) Entropy is abutted with right adjacent left mean:
The left right adjacent entropy of neighbour:
Wherein xiIndicate that the left adjacent word of candidate word, i indicate the left adjacent word number of candidate word, GjIndicate the right adjacent word of the left adjacent word of candidate word, J is current xiRight adjacent word number;P indicates to concentrate in the left adjacent word of candidate word with xiProbability as the left adjacent word of candidate word;
The right left adjacent entropy of neighbour:
Wherein xiIndicate that the right adjacent word of candidate word, i indicate the right adjacent word number of candidate word, GjIndicate the left adjacent word of the right adjacent word of candidate word, J is current xiLeft adjacent word number;P indicates to concentrate the probability using xi as the right adjacent word of candidate word in the right adjacent word of candidate word;
Left adjacent right averagely adjacent entropy:
Wherein LRE (xi) indicate candidate word the right adjacent entropy of left neighbour, m indicate the left neighbour of candidate word it is right neighbour word number;
The right adjacent entropy of neighbour's left mean:
Wherein RLE (xi) indicate candidate word the left adjacent entropy of right neighbour, m indicate the right neighbour of candidate word it is left neighbour word number.
11. a kind of new word discovery system, including the building of corpus pretreatment unit, participle unit, screening and filtering unit, repeat pattern Unit, statistical information computing unit;Wherein corpus pretreatment unit is cleaned and is saved to corpus;Participle unit cutting corpus is simultaneously marked Infuse part of speech;Screening and filtering unit carries out word frequency filtering to candidate word and part of speech filters;Repeat pattern construction unit is to candidate word structure Build repeat pattern set;Statistical information computing unit calculates the parameters such as the inner coupling degree of repeat pattern, and is filtered deletion.
CN201910519979.6A 2019-06-17 2019-06-17 New word discovery method Pending CN110334345A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910519979.6A CN110334345A (en) 2019-06-17 2019-06-17 New word discovery method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910519979.6A CN110334345A (en) 2019-06-17 2019-06-17 New word discovery method

Publications (1)

Publication Number Publication Date
CN110334345A true CN110334345A (en) 2019-10-15

Family

ID=68141071

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910519979.6A Pending CN110334345A (en) 2019-06-17 2019-06-17 New word discovery method

Country Status (1)

Country Link
CN (1) CN110334345A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112329443A (en) * 2020-11-03 2021-02-05 中国平安人寿保险股份有限公司 Method, device, computer equipment and medium for determining new words
CN113609844A (en) * 2021-07-30 2021-11-05 国网山西省电力公司晋城供电公司 Electric power professional word bank construction method based on hybrid model and clustering algorithm

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106528527A (en) * 2016-10-14 2017-03-22 深圳中兴网信科技有限公司 Identification method and identification system for out of vocabularies
CN108845982A (en) * 2017-12-08 2018-11-20 昆明理工大学 A kind of Chinese word cutting method of word-based linked character

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106528527A (en) * 2016-10-14 2017-03-22 深圳中兴网信科技有限公司 Identification method and identification system for out of vocabularies
CN108845982A (en) * 2017-12-08 2018-11-20 昆明理工大学 A kind of Chinese word cutting method of word-based linked character

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
赵小宝 等: "基于迭代算法的新词识别", 《计算机工程》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112329443A (en) * 2020-11-03 2021-02-05 中国平安人寿保险股份有限公司 Method, device, computer equipment and medium for determining new words
CN112329443B (en) * 2020-11-03 2023-07-21 中国平安人寿保险股份有限公司 Method, device, computer equipment and medium for determining new words
CN113609844A (en) * 2021-07-30 2021-11-05 国网山西省电力公司晋城供电公司 Electric power professional word bank construction method based on hybrid model and clustering algorithm
CN113609844B (en) * 2021-07-30 2024-03-08 国网山西省电力公司晋城供电公司 Electric power professional word stock construction method based on hybrid model and clustering algorithm

Similar Documents

Publication Publication Date Title
CN110781317B (en) Method and device for constructing event map and electronic equipment
CN108132927B (en) Keyword extraction method for combining graph structure and node association
CN104239539B (en) A kind of micro-blog information filter method merged based on much information
CN104765769B (en) The short text query expansion and search method of a kind of word-based vector
CN104077417B (en) People tag in social networks recommends method and system
CN110457404B (en) Social media account classification method based on complex heterogeneous network
CN105045875B (en) Personalized search and device
CN103812872B (en) A kind of network navy behavioral value method and system based on mixing Di Li Cray process
CN110110094A (en) Across a network personage's correlating method based on social networks knowledge mapping
TWI443529B (en) Methods and systems for automatically constructing domain phrases, and computer program products thereof
CN107835113A (en) Abnormal user detection method in a kind of social networks based on network mapping
CN108694647A (en) A kind of method for digging and device of trade company's rationale for the recommendation, electronic equipment
CN110992059B (en) Surrounding string behavior recognition analysis method based on big data
CN101980199A (en) Method and system for discovering network hot topic based on situation assessment
CN102945246B (en) The disposal route of network information data and device
CN104484343A (en) Topic detection and tracking method for microblog
CN105095419A (en) Method for maximizing influence of information to specific type of weibo users
CN109685153A (en) A kind of social networks rumour discrimination method based on characteristic aggregation
CN108932669A (en) A kind of abnormal account detection method based on supervised analytic hierarchy process (AHP)
CN109582714B (en) Government affair item data processing method based on time attenuation association
CN112084373B (en) Graph embedding-based multi-source heterogeneous network user alignment method
CN103631862B (en) Event characteristic evolution excavation method and system based on microblogs
CN110334345A (en) New word discovery method
CN104239321B (en) A kind of data processing method and device of Search Engine-Oriented
KR101224312B1 (en) Friend recommendation method for SNS user, recording medium for the same, and SNS and server using the same

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20191015

RJ01 Rejection of invention patent application after publication