CN110334345A - New word discovery method - Google Patents
New word discovery method Download PDFInfo
- Publication number
- CN110334345A CN110334345A CN201910519979.6A CN201910519979A CN110334345A CN 110334345 A CN110334345 A CN 110334345A CN 201910519979 A CN201910519979 A CN 201910519979A CN 110334345 A CN110334345 A CN 110334345A
- Authority
- CN
- China
- Prior art keywords
- word
- adjacent
- repeat pattern
- filtering
- corpus
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Machine Translation (AREA)
Abstract
A kind of new word discovery method, comprising the following steps: corpus is cleaned and saved;Cutting corpus simultaneously marks part of speech;Word frequency filtering and part of speech filtering;Construct repeat pattern set;Repeat pattern filtering is deleted;Remaining repeat pattern is neologisms.Filtering screening of the invention contains the judgment criterias such as word frequency, inner coupling degree, a left side (right side) adjacency information entropy, the right adjacent entropy of left neighbour, the left adjacent entropy of right neighbour, left adjacent right averagely adjacent entropy and right adjacent left mean comentropy, substantially increases the accuracy rate of neologisms.
Description
Technical field
The present invention relates to intelligent interaction field more particularly to a kind of new word discovery methods and dress based on social media
It sets.
Background technique
In the various fields of Chinese information processing, it is required to complete corresponding function based on dictionary.For example, in intelligent retrieval
In system or Intelligent dialogue system, by participle, problem retrieval, similarity mode, answering for search result or Intelligent dialogue is determined
Case etc., wherein it is that minimum unit is calculated that each process, which is by word, the basis of calculating is word dictionary, so word
Dictionary has the performance of intelligence system very big influence.
With internet flourish, appearances of the social medias platform such as microblogging, wechat change people in the past
Conventional AC interaction mode, netizen deliver the viewpoint of oneself in multiple network platform, and online friend tends to using network neologisms, most
Content and comment be often partially colloquial expression, therefore have many neologisms by online friends create come and can be with cracking speed
It is propagated on network.And the word dictionary that whether can timely update after neologisms appearance, to the Intelligent dialogue system where word dictionary
The system effectiveness of system has conclusive influence.
The method of new word discovery can be divided into two classes at present: the method based on classification and the method based on mark.Based on point
The method of class is that candidate character strings are first extracted from corpus, then judges that candidate character strings are not again according to rule or statistical information
It is neologisms.Method based on mark is then that new word discovery is combined with Chinese word segmentation, finds neologisms on the basis of participle.But mesh
Preceding new word discovery method, such as patent 201510706254.X, 201810409087.6,201810409083.8, exist with
Lower disadvantage: limitation word length, which will lead to a part of neologisms, in participle unit to be called back;Characteristic parameter is not complete enough in computing unit
Face will lead to the reduction of neologisms accuracy rate.
For the recall rate and accuracy rate for improving neologisms, the present invention proposes a kind of new word discovery method, mixes both the above side
Method carries out new word discovery according to rule and statistical information on the basis of Chinese word segmentation.
Summary of the invention
Present invention solves the technical problem that being how to promote the accuracy of new word discovery.
In order to solve the above technical problems, the present invention provides a kind of new word discovery method, comprising the following steps:
S1: corpus is cleaned and is saved;
S2: cutting corpus simultaneously marks part of speech;
S3: word frequency filtering and part of speech filtering;
S4: building repeat pattern set;
S5: repeat pattern filtering is deleted;
S6: remaining repeat pattern is neologisms.
Further, the corpus clean and save include: according to corpus cleaning rule to experiment corpus clean, and
Corpus in corpus is saved as unit of item by row.
Further, the cutting corpus and to mark part of speech include: using NLPIR tool and user-oriented dictionary to microblogging corpus
Part of speech is segmented and is marked, the corpus after obtaining part-of-speech tagging.Further, word frequency filtering and part of speech filtering include with
Lower step:
S31: the frequency that each word occurs in the corpus after statistics part-of-speech tagging, according to pre-set word frequency threshold,
Low-frequency word is put into filtering vocabulary, and high frequency words are added in initial candidate list;
S32: part of speech set is filtered in building, and word part of speech whether there is in filtering part of speech in the corpus after judging part-of-speech tagging
In set, and if it exists, add it in filtering vocabulary, be otherwise added into initial candidate list.
Further, the building repeat pattern set includes: to loop through initial candidate list L0, it is initial to get some
Candidate word is superimposed word on the right side of it on initial candidate word, and if right side, word is not present in being superimposed if in filtering vocabulary, obtains
It is added into after repeated strings 1 in repeat pattern list R, continues to be superimposed the word on the right side of it on the basis of repeated strings 1, if right
Side word is not present in then being superimposed in filtering vocabulary, is added into repeat pattern list R after obtaining repeated strings 2;Above-mentioned superposition
Process stops when right side word encounters punctuation mark or filters the word in vocabulary, and then obtains repeat pattern list.
Further, repeat pattern filtering delete the following steps are included:
S51: the repeat pattern list of acquisition is filtered using the basic dictionary of integration, if repeat pattern is present in base
Deletion is then filtered in plinth dictionary;
S52: calculating the word frequency of repeat pattern, according to pre-set word frequency threshold, is filtered to repeat pattern word frequency,
Repeat pattern lower than threshold value is deleted.
Further, the repeat pattern filtering, which is deleted, further includes steps of
S53: calculating the inner coupling degree of repeat pattern, according to pre-set threshold value, is filtered to repeat pattern,
Repeat pattern lower than threshold value is deleted;
S54: counting the left adjacent character collection and right adjacent character collection of repeat pattern, adjacent according to pre-set left (right side)
Entropy threshold filters the repeat pattern lower than left (right side) adjacent entropy threshold and deletes.
S55: the right adjacent character collection and each right adjacent character of each left adjacent character of repeat pattern are counted
Left adjacent character collection, according to pre-set left (right side) adjacent right (left side) averagely adjacent entropy threshold, for lower than averagely adjacent entropy threshold
The repeat pattern of value, which filters, to be deleted.
S56: it is filtered using the repeat pattern that Chinese collocations library obtains previous step, if repeat pattern is present in
Deletion is then filtered in Chinese collocations library.
New word discovery method of the invention, the system of use include corpus pretreatment unit, participle unit, screening and filtering list
Member, repeat pattern construction unit, statistical information computing unit;Wherein corpus pretreatment unit is cleaned and is saved to corpus;Participle
Unit cutting corpus simultaneously marks part of speech;Screening and filtering unit carries out word frequency filtering to candidate word and part of speech filters;Repeat pattern structure
It builds unit and repeat pattern set is constructed to candidate word;Statistical information computing unit calculates the ginseng such as inner coupling degree of repeat pattern
Number, and it is filtered deletion.
Compared with prior art, the present invention carries out new word discovery using the rule-based method combined with statistics, has
Below the utility model has the advantages that
First, the present invention segments experiment corpus using Chinese word segmentation tool in participle unit, integration joined
User-oriented dictionary, ensure that the accuracy of word segmentation to greatest extent, and then ensure that the accuracy rate of neologisms.
Second, the present invention not only constructs deactivated part of speech set in screening and filtering unit, also incorporates multiple dictionaries and make
For the basic dictionary in backstage.
Third, containing word frequency, inner coupling degree, a left side (right side) adjacency information entropy, left neighbour in filtering screening in the present invention
The judgment criterias such as right adjacent entropy, the left adjacent entropy of right neighbour, left adjacent right averagely adjacent entropy and right adjacent left mean comentropy, substantially increase
The accuracy rate of neologisms.
Detailed description of the invention
Fig. 1 is the flow chart of new word discovery method of the invention.
Specific embodiment
It is understandable to enable above-mentioned purpose of the invention, feature and beneficial effect to become apparent, 1 pair of sheet with reference to the accompanying drawing
The specific embodiment of invention is described in detail.
Firstly, being cleaned according to corpus cleaning rule to experiment corpus, and the corpus in corpus as unit of item
It is saved by row.
Secondly, sequence reads every a line, using NLPIR tool and user-oriented dictionary cutting microblogging corpus and part of speech is marked, is obtained
Corpus after taking part-of-speech tagging.
Then, the frequency that each word occurs in the corpus after counting part-of-speech tagging, according to pre-set word frequency threshold,
Low-frequency word is put into filtering vocabulary, and high frequency words are added in initial candidate list.
Next, part of speech set is filtered in building, word part of speech whether there is in filtering in the corpus after judging part-of-speech tagging
In part of speech set, and if it exists, add it in filtering vocabulary;If it does not exist, it is added into initial candidate list, under
One step.
Example: it in original microblogging corpus, chooses certain microblogging and " eats a native etymology in double 11 shopping Carnivals, online friend
Laugh at oneself during shopping because of cost excess budget and eat soil next month, to describe a kind of mad degree to shopping at network."
Material segmentation and part-of-speech tagging result after segmenting for the first time are as follows: eat/v soil/mono-/m of n word/n is derived from/bis- 11/nz shopping/vn of v
Carnival/n ,/wd online friend/n/m in/p shopping/vi/ude1 process/n/n because/p cost/n excess budget/n laughs at oneself/vi under
A month/nz eats/v soil/n, and/wd comes/vf describes/and v mistake/vf is right/p network/n shopping/vi/mono-/m of ude1 kind/v madness/a journey
Degree/n./wj.According to word frequency, by the low-frequency word in cutting corpus --- " process ", " cost ", " next month ", " describing ", " crazy
It is mad ", " degree " be added to filtering vocabulary in.According to part of speech, by cutting corpus " one ", ", ", " ", " ", " ",
" because ", " next ", " mistake ", " to " are added in filtering vocabulary.
S5: building repeat pattern set, if there are character string and not being punctuation mark on the right side of current candidate word, further
Judge that current string whether there is in above-mentioned filtering set of words and filtering part of speech set, if not existing, is waited to current
It selects word and current string to be combined to obtain repeat pattern, and then obtains repeat pattern list.
Specifically, initial candidate list L is looped through0, some initial candidate word is got, is superimposed it on initial candidate word
Right side word, if right side, word is not present in being superimposed if in filtering vocabulary, is added into repeat pattern column after obtaining repeated strings 1
In table R, continue to be superimposed the word on the right side of it on the basis of repeated strings 1, if if right side, word is not present in filtering vocabulary
Superposition, is added into repeat pattern list R after obtaining repeated strings 2;Above-mentioned additive process encounters punctuate symbol until right side word
Number or filtering vocabulary in word when stopping.
Example: it is directed to cutting corpus, constructs the process of repeat pattern are as follows: " eating soil " is constructed first since " eating ", due to
" one " then stops iteration in filtering vocabulary;Then since " word " building " etymology in ", " etymology in double 11 ", " etymology in
Double 11 shopping ", " being derived from double 11 ", " derived from double 11 shopping ", " are derived from double ten at " etymology is in double 11 shopping Carnivals "
One shopping Carnival ", " double 11 shopping ", " double 11 shopping Carnivals ", " shopping Carnival ", since ", " is in filtering vocabulary
In then stop iteration;Due to " ", " ", " ", " process ", " in ", " because ", " cost " in filtering vocabulary, from " super pre-
Calculate " start to construct " excess budget is laughed at oneself ", and because " next month " stop iteration in filtering vocabulary;It constructs and " eats since " eating "
Soil ", since ", ", " next ", " describing ", " mistake ", " to " stop iteration in filtering vocabulary;Building " the network purchase since " network "
Object ", due to " ", " one ", " madness ", " degree " filtering vocabulary in stop iteration, repeat pattern building process terminates.
S6: the repeat pattern list that previous step obtains is filtered using the basic dictionary of integration, if repeat pattern is deposited
It is then to filter deletion in basic dictionary.If it does not exist, into S7.
Example: it is filtered using result of the basic dictionary to repeat pattern, obtains candidate neologisms are as follows: " eating soil ", " etymology
In ", " etymology is in double 11 ";" etymology in double 11 shopping ", " etymology is in double 11 shopping Carnivals ", " derived from double 11 ",
" derived from double 11 shopping ", " derived from double 11 shopping Carnivals ", " double 11 shopping ", " double 11 shopping Carnivals ", " shopping
Carnival ", " excess budget is laughed at oneself ", " shopping at network ".
S7: the word frequency of repeat pattern is calculated.According to pre-set word frequency threshold, repeat pattern word frequency is filtered,
Repeat pattern lower than threshold value is deleted.
Example: filter out following candidate word by calculating word frequency: " etymology in ", " etymology is in double 11 ";" etymology Yu Shuanshi
One shopping ", " derived from double 11 ", " being derived from double 11 shopping ", " is purchased " etymology is in double 11 shopping Carnivals " derived from double 11
Object Carnival ", " excess budget is laughed at oneself ", " shopping at network ".
S8: the inner coupling degree of repeat pattern is calculated.All substrings of exhaustive repeat pattern, and internal coupling is carried out to substring
Right calculating acquires the value of the repeat pattern inner coupling degree by formula (1).According to pre-set threshold value, to repetition mould
Formula is filtered, and the repeat pattern lower than threshold value is deleted.
Wherein, word inner tight degree can be measured by inner coupling degree (Inside Coupling), definition is such as
Under: all possibility of two points of word strings are divided into word string w and combine { (w11, w12), (w21, w22)…(wi1, wi2)…(wn1,
wn2) (example " Chinese " all possible combinations: (" China ", " people "), (" in ", " compatriots ") }, obtained IC (w) is known as word string
The inner coupling degree of w.Wherein P (w) indicates that word string w in textview field D (original language material) probability of occurrence, passes through formula:
It calculates, N (w) indicates the number that w word string occurs in textview field D, NDIndicate the total number of word of textview field.IC value is got over
Greatly, illustrate that the degree of correlation between word string is higher, the cohesion of the word is higher;Conversely, IC value is smaller, illustrate the correlation between word string
Degree is lower, and the cohesion of the word is lower.
Example: filter out following candidate word by calculating inner coupling degree: " double 11 shopping ", " double 11 shopping are rejoiced with wild excitement
Section ", " shopping Carnival ".
S9: the left adjacent character collection and right adjacent character collection of repeat pattern are counted.It is found out respectively by formula (3) each heavy
The adjacent entropy in a left side (right side) for complex pattern.According to pre-set left (right side) adjacent entropy threshold, for lower than left (right side) adjacent entropy threshold
Repeat pattern filter delete.
Set C={ the c of word string w all individual characters for possibly being present at left (right side) side w in textview field D1, c2, ci... cn}
A referred to as left side of w (right side) adjacent word collection.Formula is passed through to C:
The IE (w) being calculated is known as the comentropy of the adjacent word collection in a left side (right side) of w.Wherein niIndicate ciA left side (right side) as w is adjacent
The number that word occurs, n indicate that the sum of the number that all words in adjacent word collection C occur as the adjacent word in a left side (right side) of w, lb are indicated with 2
For the logarithm at bottom.
S10: the right adjacent character collection and each right adjacent character of each left adjacent character of repeat pattern are counted
Left adjacent character collection.The right adjacent entropy of left neighbour and the left adjacent entropy of right neighbour that each word is calculated by formula (4), (5), pass through formula
(6), (7) calculate the left adjacent right averagely adjacent entropy and the right adjacent entropy of adjacent left mean of each word.It is adjacent according to pre-set left (right side)
Right (left side) averagely adjacent entropy threshold, filters the repeat pattern lower than averagely adjacent entropy threshold and deletes.
The left right adjacent entropy of neighbour:
Wherein xiIndicate that the left adjacent word of candidate word, i indicate the left adjacent word number of candidate word, GjIndicate the right side of the left adjacent word of candidate word
Adjacent word, j are current xiRight adjacent word number;P indicates to concentrate in the left adjacent word of candidate word with xiAs the general of the left adjacent word of candidate word
Rate.
The right left adjacent entropy of neighbour:
Wherein xiIndicate that the right adjacent word of candidate word, i indicate the right adjacent word number of candidate word, GjIndicate a left side for the right adjacent word of candidate word
Adjacent word, j are current xiLeft adjacent word number;P indicates to concentrate in the right adjacent word of candidate word using xi as the general of the right adjacent word of candidate word
Rate.
Left adjacent right averagely adjacent entropy:
Wherein LRE (xi) indicate candidate word the right adjacent entropy of left neighbour, m indicate the left neighbour of candidate word it is right neighbour word number.
The right adjacent entropy of neighbour's left mean:
Wherein RLE (xi) indicate candidate word the left adjacent entropy of right neighbour, m indicate the right neighbour of candidate word it is left neighbour word number.
S11: it is filtered using the repeat pattern that Chinese collocations library obtains previous step, if repeat pattern is present in
Deletion is then filtered in Chinese collocations library.
S12: remaining word is taken as neologisms.
Example: left adjacent comentropy, right adjacent comentropy, the right comentropy of left neighbour, the left comentropy of right neighbour, left adjacent right average neighbour are calculated
Connect entropy, the right adjacent adjacent entropy of left mean, the candidate word not filtered in this corpus.Finally obtained neologisms are " eating soil ".
New word discovery method of the invention, the system of use include corpus pretreatment unit, participle unit, screening and filtering list
Member, repeat pattern construction unit, statistical information computing unit;Wherein corpus pretreatment unit is cleaned and is saved to corpus;Participle
Unit cutting corpus simultaneously marks part of speech;Screening and filtering unit carries out word frequency filtering to candidate word and part of speech filters;Repeat pattern structure
It builds unit and repeat pattern set is constructed to candidate word;Statistical information computing unit calculates the ginseng such as inner coupling degree of repeat pattern
Number, and it is filtered deletion.
Although present disclosure is as above, present invention is not limited to this.Anyone skilled in the art are not departing from this
It in the spirit and scope of invention, can make various changes or modifications, therefore protection scope of the present invention should be with claim institute
Subject to the range of restriction.
Claims (11)
1. a kind of new word discovery method, which comprises the following steps:
S1: corpus is cleaned and is saved;
S2: cutting corpus simultaneously marks part of speech;
S3: word frequency filtering and part of speech filtering;
S4: building repeat pattern set;
S5: repeat pattern filtering is deleted;
S6: remaining repeat pattern is neologisms.
2. new word discovery method according to claim 1, which is characterized in that the corpus clean and save include: according to
Corpus cleaning rule cleans experiment corpus, and the corpus in corpus is saved as unit of item by row.
3. new word discovery method according to claim 2, which is characterized in that the cutting corpus simultaneously marks part of speech and includes:
Part of speech is segmented and marked to microblogging corpus using NLPIR tool and user-oriented dictionary, the corpus after obtaining part-of-speech tagging.
4. new word discovery method according to claim 3, which is characterized in that word frequency filtering and part of speech filtering include with
Lower step:
S31: the frequency that each word occurs in the corpus after statistics part-of-speech tagging, according to pre-set word frequency threshold, low frequency
Word is put into filtering vocabulary, and high frequency words are added in initial candidate list;
S32: part of speech set is filtered in building, and word part of speech whether there is in filtering part of speech set in the corpus after judging part-of-speech tagging
In, and if it exists, it adds it in filtering vocabulary, is otherwise added into initial candidate list.
5. new word discovery method according to claim 4, which is characterized in that the building repeat pattern set includes: to follow
Ring traverses initial candidate list L0, some initial candidate word is got, word on the right side of it is superimposed on initial candidate word, if right side
Word is not present in then being superimposed in filtering vocabulary, is added into repeat pattern list R after obtaining repeated strings 1;Continue repeating
The word being superimposed on the right side of it on the basis of string 1, if right side, word is not present in being superimposed if in filtering vocabulary, obtains repeated strings 2
After be added into repeat pattern list R;Above-mentioned additive process encounters in punctuation mark or filtering vocabulary until right side word
Stop when word, and then obtains repeat pattern list.
6. new word discovery method according to claim 5, which is characterized in that it includes following that the repeat pattern filtering, which is deleted,
Step:
S51: the repeat pattern list of acquisition is filtered using the basic dictionary of integration, if repeat pattern is present in basic word
Deletion is then filtered in allusion quotation;
S52: calculating the word frequency of repeat pattern, according to pre-set word frequency threshold, is filtered to repeat pattern word frequency, low
It is deleted in the repeat pattern of threshold value.
7. new word discovery method according to claim 6, which is characterized in that further packet is deleted in the repeat pattern filtering
Include following steps:
S53: calculating the inner coupling degree of repeat pattern, according to pre-set threshold value, is filtered to repeat pattern, being lower than
The repeat pattern of threshold value is deleted;
S54: counting the left adjacent character collection and right adjacent character collection of repeat pattern, according to pre-set left (right side) adjacent entropy threshold
Value filters the repeat pattern lower than left (right side) adjacent entropy threshold and deletes.
S55: the right adjacent character collection of each left adjacent character of repeat pattern and the left neighbour of each right adjacent character are counted
Character set is connect, according to pre-set left (right side) adjacent right (left side) averagely adjacent entropy threshold, for lower than averagely adjacent entropy threshold
Repeat pattern filtering is deleted.
S56: it is filtered using the repeat pattern that Chinese collocations library obtains previous step, if repeat pattern is present in Chinese
Deletion is then filtered in collocations library.
8. new word discovery method according to claim 7, which is characterized in that wherein calculate the inner coupling degree of repeat pattern
Include: all substrings of exhaustive repeat pattern, and inner coupling degree calculating is carried out to substring, which is acquired by formula (1)
The value of the mode internal degree of coupling:
Wherein, word inner tight degree can be measured by inner coupling degree, is defined as follows: two are divided into word string w
The possibility for dividing word string all combines { (w11, w12), (w21, w22)…(wi1, wi2)…(wn1, wn2), obtained IC (w) is known as word
The inner coupling degree of string w;
Wherein P (w) indicates that word string w in textview field D probability of occurrence, passes through formula (2):
It calculates, N (w) indicates the number that w word string occurs in textview field D, NDIndicate the total number of word of textview field.IC value is bigger, explanation
Degree of correlation between word string is higher, and the cohesion of the word is higher;Conversely, IC value is smaller, illustrate that the degree of correlation between word string is got over
Low, the cohesion of the word is lower.
9. new word discovery method according to claim 8, which is characterized in that count repeat pattern left adjacent character collection and
Right adjacent character collection includes: the adjacent entropy in a left side (right side) for finding out each repeat pattern respectively by formula (3):
Set C={ the c of word string w all individual characters for possibly being present at left (right side) side w in textview field D1, c2, ci... cnIt is known as w
The adjacent word collection in a left side (right side).Formula (3) are passed through to C:
The IE (w) being calculated is known as the comentropy of the adjacent word collection in a left side (right side) of w.Wherein niIndicate ciThe adjacent word in a left side (right side) as w goes out
Existing number, n indicate that the sum of the number that all words in adjacent word collection C occur as the adjacent word in a left side (right side) of w, lb indicate with 2 to be bottom
Logarithm.
10. new word discovery method according to claim 9, which is characterized in that count each left adjoining of repeat pattern
The right adjacent character collection of character and the left adjacent character collection of each right adjacent character include: to be calculated often by formula (4), (5)
The right adjacent entropy of left neighbour and the left adjacent entropy of right neighbour of a word calculate the left adjacent right averagely adjacent entropy of each word by formula (6), (7)
Entropy is abutted with right adjacent left mean:
The left right adjacent entropy of neighbour:
Wherein xiIndicate that the left adjacent word of candidate word, i indicate the left adjacent word number of candidate word, GjIndicate the right adjacent word of the left adjacent word of candidate word,
J is current xiRight adjacent word number;P indicates to concentrate in the left adjacent word of candidate word with xiProbability as the left adjacent word of candidate word;
The right left adjacent entropy of neighbour:
Wherein xiIndicate that the right adjacent word of candidate word, i indicate the right adjacent word number of candidate word, GjIndicate the left adjacent word of the right adjacent word of candidate word,
J is current xiLeft adjacent word number;P indicates to concentrate the probability using xi as the right adjacent word of candidate word in the right adjacent word of candidate word;
Left adjacent right averagely adjacent entropy:
Wherein LRE (xi) indicate candidate word the right adjacent entropy of left neighbour, m indicate the left neighbour of candidate word it is right neighbour word number;
The right adjacent entropy of neighbour's left mean:
Wherein RLE (xi) indicate candidate word the left adjacent entropy of right neighbour, m indicate the right neighbour of candidate word it is left neighbour word number.
11. a kind of new word discovery system, including the building of corpus pretreatment unit, participle unit, screening and filtering unit, repeat pattern
Unit, statistical information computing unit;Wherein corpus pretreatment unit is cleaned and is saved to corpus;Participle unit cutting corpus is simultaneously marked
Infuse part of speech;Screening and filtering unit carries out word frequency filtering to candidate word and part of speech filters;Repeat pattern construction unit is to candidate word structure
Build repeat pattern set;Statistical information computing unit calculates the parameters such as the inner coupling degree of repeat pattern, and is filtered deletion.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910519979.6A CN110334345A (en) | 2019-06-17 | 2019-06-17 | New word discovery method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910519979.6A CN110334345A (en) | 2019-06-17 | 2019-06-17 | New word discovery method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110334345A true CN110334345A (en) | 2019-10-15 |
Family
ID=68141071
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910519979.6A Pending CN110334345A (en) | 2019-06-17 | 2019-06-17 | New word discovery method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110334345A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112329443A (en) * | 2020-11-03 | 2021-02-05 | 中国平安人寿保险股份有限公司 | Method, device, computer equipment and medium for determining new words |
CN113609844A (en) * | 2021-07-30 | 2021-11-05 | 国网山西省电力公司晋城供电公司 | Electric power professional word bank construction method based on hybrid model and clustering algorithm |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106528527A (en) * | 2016-10-14 | 2017-03-22 | 深圳中兴网信科技有限公司 | Identification method and identification system for out of vocabularies |
CN108845982A (en) * | 2017-12-08 | 2018-11-20 | 昆明理工大学 | A kind of Chinese word cutting method of word-based linked character |
-
2019
- 2019-06-17 CN CN201910519979.6A patent/CN110334345A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106528527A (en) * | 2016-10-14 | 2017-03-22 | 深圳中兴网信科技有限公司 | Identification method and identification system for out of vocabularies |
CN108845982A (en) * | 2017-12-08 | 2018-11-20 | 昆明理工大学 | A kind of Chinese word cutting method of word-based linked character |
Non-Patent Citations (1)
Title |
---|
赵小宝 等: "基于迭代算法的新词识别", 《计算机工程》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112329443A (en) * | 2020-11-03 | 2021-02-05 | 中国平安人寿保险股份有限公司 | Method, device, computer equipment and medium for determining new words |
CN112329443B (en) * | 2020-11-03 | 2023-07-21 | 中国平安人寿保险股份有限公司 | Method, device, computer equipment and medium for determining new words |
CN113609844A (en) * | 2021-07-30 | 2021-11-05 | 国网山西省电力公司晋城供电公司 | Electric power professional word bank construction method based on hybrid model and clustering algorithm |
CN113609844B (en) * | 2021-07-30 | 2024-03-08 | 国网山西省电力公司晋城供电公司 | Electric power professional word stock construction method based on hybrid model and clustering algorithm |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110781317B (en) | Method and device for constructing event map and electronic equipment | |
CN108132927B (en) | Keyword extraction method for combining graph structure and node association | |
CN104239539B (en) | A kind of micro-blog information filter method merged based on much information | |
CN104765769B (en) | The short text query expansion and search method of a kind of word-based vector | |
CN104077417B (en) | People tag in social networks recommends method and system | |
CN110457404B (en) | Social media account classification method based on complex heterogeneous network | |
CN105045875B (en) | Personalized search and device | |
CN103812872B (en) | A kind of network navy behavioral value method and system based on mixing Di Li Cray process | |
CN110110094A (en) | Across a network personage's correlating method based on social networks knowledge mapping | |
TWI443529B (en) | Methods and systems for automatically constructing domain phrases, and computer program products thereof | |
CN107835113A (en) | Abnormal user detection method in a kind of social networks based on network mapping | |
CN108694647A (en) | A kind of method for digging and device of trade company's rationale for the recommendation, electronic equipment | |
CN110992059B (en) | Surrounding string behavior recognition analysis method based on big data | |
CN101980199A (en) | Method and system for discovering network hot topic based on situation assessment | |
CN102945246B (en) | The disposal route of network information data and device | |
CN104484343A (en) | Topic detection and tracking method for microblog | |
CN105095419A (en) | Method for maximizing influence of information to specific type of weibo users | |
CN109685153A (en) | A kind of social networks rumour discrimination method based on characteristic aggregation | |
CN108932669A (en) | A kind of abnormal account detection method based on supervised analytic hierarchy process (AHP) | |
CN109582714B (en) | Government affair item data processing method based on time attenuation association | |
CN112084373B (en) | Graph embedding-based multi-source heterogeneous network user alignment method | |
CN103631862B (en) | Event characteristic evolution excavation method and system based on microblogs | |
CN110334345A (en) | New word discovery method | |
CN104239321B (en) | A kind of data processing method and device of Search Engine-Oriented | |
KR101224312B1 (en) | Friend recommendation method for SNS user, recording medium for the same, and SNS and server using the same |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20191015 |
|
RJ01 | Rejection of invention patent application after publication |