CN106815187A - A kind of efficient new terminology identifying system and method - Google Patents

A kind of efficient new terminology identifying system and method Download PDF

Info

Publication number
CN106815187A
CN106815187A CN201510845390.7A CN201510845390A CN106815187A CN 106815187 A CN106815187 A CN 106815187A CN 201510845390 A CN201510845390 A CN 201510845390A CN 106815187 A CN106815187 A CN 106815187A
Authority
CN
China
Prior art keywords
newterm
new terminology
result
text
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510845390.7A
Other languages
Chinese (zh)
Other versions
CN106815187B (en
Inventor
符建辉
王卫明
曹阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
KNOWOLOGY INTELLIGENT TECHNOLOGY Co Ltd
Original Assignee
KNOWOLOGY INTELLIGENT TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by KNOWOLOGY INTELLIGENT TECHNOLOGY Co Ltd filed Critical KNOWOLOGY INTELLIGENT TECHNOLOGY Co Ltd
Priority to CN201510845390.7A priority Critical patent/CN106815187B/en
Publication of CN106815187A publication Critical patent/CN106815187A/en
Application granted granted Critical
Publication of CN106815187B publication Critical patent/CN106815187B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Abstract

The present invention relates to a kind of efficient new terminology identifying system and method, its system includes carrying out participle to every document in input text library RCorpus, forms text word sequence modules A;New terminology identification module B is carried out to every document sequence of terms in the text library TCorpus after participle;New terminology to recognizing carries out authentication module C;Its method is comprised the following steps:The first step:Text word sequence modules A carries out participle to every text in input text library RCorpus, forms text word sequence;Second step:New terminology identification module B carries out new terminology identification to every text word sequence in the text library TCorpus after participle;3rd step:Authentication module C is verified to the new terminology for recognizing;The present invention proposes the new terminology recognition methods high of a kind of high precision, recall rate and system.The accuracy of identification of new terminology is 93.8%.

Description

A kind of efficient new terminology identifying system and method
Technical field
The present invention relates to Chinese natural language process, Chinese new words automatic identification field, more particularly to a kind of new terminology is automatic Identifying system and method.
Background technology
With the fast development of internet, all kinds of new terminologies emerge in an endless stream, this natural language processing application, automatically using soft Part (such as Words partition system), dictionary are incorporated portable lamp into own forces and bring very big difficulty.
The research of new terminology identification has been expanded for many years.Existing method has following three class.First Statistics-Based Method. For example, Kenneth Ward Church and B é atrice Daille et al. use mutual information (Mutual Information) To extract fixed Combination and the collocation of word, they think the neighbouring character combination typically all term of frequent co-occurrence, then make The co-occurrence degree of phrase is judged with mutual information.And for example, Ted Dunning and Jonathan D.Cohen et al. use logarithm Likelihood ratio (Log-Likelihood Ratio) counts the identification problem of low frequency word, from theoretical and true two aspects demonstration The validity of this method.Statistical method also includes maximum matching method, Hidden Markov method, maximum entropy method Deng.Second is the method based on linguistic feature and lexical pattern.For example, Liu Lei, Wang Shi and Tian Guogang use multiple features, With reference to morphology and syntactic pattern, new technical term is obtained.3rd is the integrated application of first two method, thus is overcome each From deficiency.
But, by detailed experimental analysis, the above method deposits following two problems.
Problem 1:New terminology accuracy of identification problem.Using the method for pure statistics, although more new terminology can be recognized, but Substantial amounts of mistake would generally be introduced;That is, not being the Chinese character string of new terminology, new terminology is mistaken for.For example, in sentence In " general headquarters organize cadre to learn spirit of the Central Committee ", during using statistical method, it is easy to just by " general headquarters organize cadre ", " group Knit cadre ", " cadre's study " etc. is known by mistake is new terminology, and substantially, they are not.On the other hand, it is to be ensured that new art Language amount accuracy of identification is very high, and apprehension span is restricted again.This is one of key issue that the present invention needs solution.
Problem 2:New terminology apprehension span problem.Because the combined situation of word is a lot, therefore new terminology automatic identification is held very much Easily omit new significant term.Therefore, how to improve apprehension span is an important problem.This is also that the present invention is needed One of key issue to be solved.
The content of the invention
The technical problems to be solved by the invention:New terminology accuracy of identification problem, apprehension span problem.
For problem 1, invention introduces seed glossary technology, the identification of new terminology is not carried out merely with seed words allusion quotation, The new terminology of new acquisition is also verified with it simultaneously.
For problem 2, invention introduces the iterative new terminology identification technology of multi-source.First, using multi-source analysis method, According to multiple texts lifted according to checking the accuracy of identification of new terminology;The new terminology of acquisition is added to seed art simultaneously In dictionary, constantly recycle, so as to obtain more new terminologies.
To achieve these goals, the invention provides following technical scheme:A kind of efficient new terminology identifying system, its feature exists In:Including carrying out participle to every document in input text library RCorpus, text word sequence modules A is formed;After participle Text library TCorpus in every document sequence of terms carry out new terminology identification module B;New terminology to recognizing is verified Module C;
In module described above, modules A carries out participle to every text gear in input text library RCorpus, forms the text after participle This word sequence, therefore the text library TCorpus after participle is formed, used for new terminology identification module B;New terminology identification module B New terminology identification is carried out to every document in the text library TCorpus after participle, one group of new terminology result to be verified is formed, Used for authentication module C;Authentication module C is further verified to the new terminology that new terminology identification module B is recognized.
A kind of efficient new terminology recognition methods, it is characterised in that:Comprise the following steps:
The first step:Text word sequence modules A carries out participle to every text in input text library RCorpus, forms text word Sequence;
We carry out participle, word segmentation result using an ICTCLAS system increased income to every input text D in RCorpus It is T '=W1/pos1W2/pos2…Wi/posi…Wn/posn, wherein each WiBe a Chinese word, Chinese character, punctuation mark, Arabic numerals, English word or letter, posiIt is its corresponding part of speech;
In order to represent difference, every text text in RCorpus by after participle, produced text, we are designated as TCorpus;
Second step:New terminology identification module B carries out new art to every text word sequence in the text library TCorpus after participle Language is recognized;
Current text to be identified is Di, TiIt is its title, SijIt is DiJ-th strip sentence current to be identified;To SijCarry out with The treatment of lower step, forms the new terminology result of candidate, is stored in set tmp_result:
Step B1:It is sky to set tmp_result;
Tmp_result is used to deposit the new terminology result for identifying, passes to authentication module C and is verified.Therefore, New terminology result in tmp_result is also referred to as the new terminology result of candidate, new terminology result also referred to as to be verified;
Step B2:By SijIn continuous most long, part of speech labeled as a, b, j, n, m, q morphology into candidate's new terminology, It is designated as NewTerm;" continuously most long ", refers in SijThe two ends of middle NewTerm do not have the word that part of speech is a, b, j, n;
Step B3:If in SijIn the part of speech of the and then word W of NewTerm be k, i.e. W is probably the suffix of NewTerm, Then set
Step B4:If in SijIn the part of speech of word W that is located at before NewTerm be h, i.e. after W is probably NewTerm Sew, then set
Step B5:By (NewTerm, Ti, Sij) be put into tmp_result;
3rd step:Authentication module C is verified to the new terminology for recognizing;
The groundwork of authentication module C is, using multi-source proof method, special proof method, new terminology identification module B to be produced New terminology in tmp_result is verified that authenticated new terminology is put into set result;The method of authentication module C It is as follows:
Step C1:It is sky to set result;
Step C2:Every a pair (NewTerm, T in tmp_resulti, Sij) circulate and be following steps C3, C4 and C5;
Step C3:If there is (NewTerm, T in tmp_resulti', Sij'), and TiWith Ti' difference " i.e. NewTerm In appearing in two different texts in TCorpus ", then NewTerm is put into result;Otherwise, step is performed C4;
As described in above-mentioned step C3, although NewTerm is in entitled TiSentence SijIn be identified as candidate's new terminology, but It is that NewTerm might not be exactly a correct new terminology;But, in entitled Ti' sentence Sij' in be also identified as it is new Term, then NewTerm is that the possibility of correct new terminology can be greatly promoted;
Step C4:If there is a seed term Term in seed dictionary so that NewTerm is similar to the weighting of Term Degree wsim (NewTerm, Term) > α, wherein α ∈ [0,1] are a threshold value), then NewTerm is put into result;It is no Then, step C5 is performed;
To provide two calculating of the Weighted Similarity wsim (NewTerm, Term) of term, we first provide function 2gram's Computational methods;To a non-NULL Chinese character string Sent=C1C2…Ci-1Ci…CK-1CK, wherein CiIt is Chinese character, numeral, English alphabet, I Introduce one take the lead tail tag note Chinese character string Sent=$ C1C2…Ci-1Ci…CK-1CK$;2gram (Sent) be one by Sent from The set that right continuous two characters of left-hand are constituted, i.e. 2gram (Sent)={ C1, C1C2..., Ck-1CK, CK$};
It is pointed out that the importance of each element is differed in 2gram (Sent):Ci-1CiWhen being a word in Chinese, Ci-1Ci It is bigger in the effect of 2gram (Sent);In order to reflect the importance of each element in 2gram (Sent), to previously defined Interset(S1, S2) be improved, a new radix is introduced, it is called weighting common factor radix WInterset (S1, S2);Its meter Calculation method is as follows:To given two set S1And S2
(1)WInterset(S1, S2)=0;
(2) to Interset (S1, S2) each element e, if e is a word in Chinese, WInterset (S1, S2)= WInterset(S1, S2)+1.2, i.e. WInterset (S1, S2) cumulative 1.2, rather than 1;Otherwise WInterset (S1, S2)= WInterset(S1, S2)+1, i.e. WInterset (S1, S2) cumulative 1;
The computational methods of wsim (NewTerm, Term) are as follows:
(1) if NewTerm and Term has identical prefix and suffix, wsim (NewTerm, Term)=1;
(2) if NewTerm and Term does not have identical prefix and suffix,
Step C5:Using NewTerm in SijLinguistic context verified;Specific method is:When NewTerm is in SijAbove point The part of speech of word is one of c, d, p, r, u, z, and NewTerm is in SijThe part of speech of participle below be c, d, p, r, u, For the moment, NewTerm is a correct new terminology to z, is added in result;Otherwise abandon, that is, be added without result In;
Step C6:Result is as end product for output.
Beneficial effect:The present invention proposes the new terminology recognition methods high of a kind of high precision, recall rate and system.By up to In the test of 2GB page language materials, except News Field, various industries and professional domain are also covered by.The accuracy of identification of new terminology is 93.8%.Therefore, the present invention achieves preferable recognition performance, has reached the purpose of practical application, is dictionary writing, divides The substantial amounts of application such as word application, text classification, the analysis of public opinion, advertisement analysis, has established solid foundation.
Brief description of the drawings
Fig. 1 is the workflow diagram of new terminology identifying system of the invention;
Fig. 2 is the method for work schematic diagram of module B;
Fig. 3 is the method for work schematic diagram of module C.
Specific embodiment
In order to the clearer explanation present invention, the defined below and term that is explained as follows:
(1) word is long:That is the length of word.One Chinese word is made by one or more Chinese characters, and a length for word is equal to is somebody's turn to do The number of the word contained by word.The word of word a length of 1 is referred to as a words, and the word of word a length of 2 is referred to as two words, and the word of word a length of 3 claims It is three words, by that analogy.
(2) multi-character words:Be made up of the Chinese character of word a length of 3 or more, the referred to as multi-character words of the word with definite meaning, for example " in Common spirit ", " positive energy ", wherein the former is four words, and the latter is three words.
(3) dictionary:The list of the word being made up of one group of word, word therein can be monosyllabic word (i.e. a length of 1) of word, two words Word (i.e. a length of 2) of word or multi-character words.
(4) new terminology:A glossary is given, the term not occurred in this dictionary is referred to as new terminology, does not include also referred to as Term.
(5) text word sequence, sentence word sequence:Participle is carried out to a text, a sequence for word is formed, referred to as text This word sequence, abbreviation word sequence.When text is a sentence, we are sometimes referred to as sentence word sequence.It is clear in context In the case of, sentence word sequence is also referred to as word sequence by us.
(6) ICTCLAS systems:One Words partition system that is free, increasing income.The system is input with text, is output as this article This segmentation sequence.ICTCLAS system downloads network address is:http://ictclas.nlpir.org.After participle, each participle Part of speech is indicated, wherein a represents that adjective, b represent that distinction word, c represent that conjunction, d represent that adverbial word, h represent that prefix word, j are represented Abbreviation word, k represent that suffix word, m represent that number, n represent that noun, p represent that preposition, q represent that measure word, r represent that pronoun, u are represented Auxiliary word, z represent descriptive word, etc..
(7) text:An either short sentence, or article long, we are referred to as text.For simplicity, In the description, every webpage that we mention, without other especially explain in the case of, refer both to through removal html tag, CSS codes (i.e. CSS, Full Name in English Cascading Style Sheets), DIV codes are (i.e. in CSS Location technology, Full Name in English Division) and the resulting plain text such as JS codes (Full Name in English javascripts).
(8) string-concatenation:Given any two character string V1And V2, V1And V2Be spliced into by they be seamlessly connected together The new character string for obtaining, is designated asFor example, V1=general, V2=analysis,
(9) intersection of sets collection, union, radix:Give two set S1And S2, their union is designated as Interset (S1, S2), It is to appear in S by those1In and also appear in S simultaneously2In element constitute set.For example, S1={ notebook, computer }, S2={ pen Note this, computer, 4G }, then Interset (S1, S2)={ notebook, computer }.Since appearing in S1Or appear in S2In element The collection of composition is combined into S1And S2Union, be designated as Union (S1, S2).For example, S1={ notebook, computer }, S2={ notebook, electricity Brain, 4G }, then Union (S1, S2)={ notebook, computer, 4G }.The number of element is cardinality of a set, S in set1Radix It is designated as | S1|, such as S1={ notebook, computer }, | S1|=2.
The present invention is described in more detail with reference to the accompanying drawings and detailed description.
A kind of efficient new terminology identifying system proposed by the present invention and method, with a text set RCorpus as input (referred to as Input text library), it is operated using following three modules:
Modules A:Participle is carried out to every text in input text library RCorpus, text word sequence is formed.
Module B:New terminology identification is carried out to every text word sequence in the text library TCorpus after participle.
Module C:New terminology to recognizing is verified.
Modules A:Participle is carried out to every text in input text library RCorpus, text word sequence is formed.
We carry out participle, word segmentation result using an ICTCLAS system increased income to every input text D in RCorpus It is T '=W1/pos1W2/pos2…Wi/posi…Wn/posn, wherein each WiBe a Chinese word, Chinese character, punctuation mark, Arabic numerals, English word or letter, posiIt is its corresponding part of speech.
In participle, the mark of part of speech is current in computer circle.Common part of speech has n (title), v (verb), a (adjective), d (adverbial word), p (preposition) etc..For example, the participle of sentence " general headquarters organize cadre to learn spirit of the Central Committee " Result is:" general headquarters/n tissue/n cadres/n study/v centers/n spirit/n ".
In order to represent difference, every text in RCorpus by after participle, produced text, we are designated as TCorpus.
Module B:New terminology identification is carried out to every text word sequence in the text library TCorpus after participle.
The specific implementation of module B, it is as described below:
Assuming that current text to be identified is Di, TiIt is its title, SijIt is DiJ-th strip sentence current to be identified.To Sij The treatment for following the steps below, forms the new terminology result (new terminology result also referred to as to be verified) of candidate, is stored in set In tmp_result:
Step B1:It is sky to set tmp_result.
Tmp_result is used to deposit the new terminology result for identifying, passes to authentication module C and is verified.Therefore, New terminology result in tmp_result is also referred to as the new terminology result (new terminology result also referred to as to be verified) of candidate.
Step B2:By SijIn continuous most long, part of speech labeled as a, b, j, n, m, q morphology into candidate's new terminology, It is designated as NewTerm.
" continuously most long " described in step B2, refers in SijThe two ends of middle NewTerm do not have the word that part of speech is a, b, j, n.
Step B3:If in SijIn the part of speech of the and then word W of NewTerm be k (i.e. W is probably the suffix of NewTerm), Then set
Step B4:If in SijIn be located at the part of speech h of the word W before NewTerm (i.e. after W is probably NewTerm Sew), then set
Step B5:By (NewTerm, Ti, Sij) be put into tmp_result.
Module C:New terminology to recognizing is verified.
The groundwork of this module is the art in the tmp_result produced to module B using multi-source proof method, special proof method Language is verified that authenticated new terminology is put into set result.The method of module C is as follows:
Step C1:It is sky to set result.
Step C2:Every a pair (NewTerm, T in tmp_resulti, Sij) circulate and be following steps C3, C4 and C5.
Step C3:If there is (NewTerm, T in tmp_resulti', Sij'), and TiWith Ti' difference (i.e. NewTerm In appearing in two different texts in TCorpus), then NewTerm is put into result;Otherwise, step is performed C4。
As described in above-mentioned step C3, although NewTerm is in entitled TiSentence SijIn be identified as candidate's new terminology, but It is that NewTerm might not be exactly a correct new terminology.But, in entitled Ti' sentence Sij' in be also identified as it is new Term, then NewTerm is that the possibility of correct new terminology can be greatly promoted.(certainly, if also exist the 3rd it is different Text come aid in verify NewTerm, NewTerm is that the possibility of correct new terminology can be higher, but can so reduce new terminology The broadband of identification, experiment shows, it is not necessary to the 3rd checking of different texts.)
Step C4:If there is a seed term Term in seed dictionary so that NewTerm is similar to the weighting of Term , then be put into NewTerm in result by degree wsim (NewTerm, Term) > α (α ∈ [0,1] are a threshold value);Otherwise, Perform step C5.
To provide two calculating of the Weighted Similarity wsim (NewTerm, Term) of term, we first provide the meter of function 2gram Calculation method.To a non-NULL Chinese character string Sent=C1C2…Ci-1Ci…CK-1CK, wherein CiIt is Chinese character, numeral, English alphabet, we Introduce a Chinese character string Sent=$ C for taking the lead tail tag note1C2…Ci-1Ci…CK-1CK$.2gram (Sent) is one by certainly left in Sent To the set that right continuous two characters are constituted, i.e. 2gram (Sent)={ C1, C1C2..., Ck-1CK, CK$}。
It is pointed out that the importance of each element is identical in 2gram (Sent):Ci-1CiWhen being a word in Chinese, Ci-1Ci It is bigger in the effect of 2gram (Sent).In order to reflect the importance of each element in 2gram (Sent), the present invention is to above fixed Interset (the S of justice1, S2) be improved, a new radix is introduced, it is called weighting common factor radix WInterset (S1, S2)。 Its computational methods is as follows:To given two set S1And S2
(1)WInterset(S1, S2)=0;
(2) to Interset (S1, S2) each element e, if e is a word in Chinese, WInterset (S1, S2)= WInterset(S1, S2)+1.2 (i.e. WInterset (S1, S2) cumulative 1.2, rather than 1);Else if e is not a Chinese In word, then WInterset (S1, S2)=WInterset (S1, S2)+1 (i.e. WInterset (S1, S2) add up 1);
The computational methods of wsim (NewTerm, Term) are as follows:
(1) if NewTerm and Term has identical prefix and suffix, wsim (NewTerm, Term)=1;
(2) if NewTerm and Term does not have identical prefix and suffix,
To the first situation of above-mentioned formula, we provide an example as explanation.Make NewTerm=" in it is totally ten eight big by 4th Secondary meeting ", Term=" in altogether Luochuan meeting ".Now wsim (" in the totally ten eight big fourth session ", " and in Luochuan meeting altogether View ")=1, because Term and NewTerm have common prefix and suffix.
Step C5:Using NewTerm in SijLinguistic context verified.Specific method is:When NewTerm is in SijAbove point The part of speech of word is one of c, d, p, r, u, z, and NewTerm is in SijThe part of speech of participle below be c, d, p, r, u, For the moment, NewTerm is a correct new terminology to z, is added in result;Otherwise abandon, that is, be added without result In.
Step C6:Result is as end product for output.
Experiment effect
The present invention proposes a kind of recognition methods of efficient multiword new terminology and system.In the test by up to 2GB pages language material In, except News Field, it is also covered by various industries and professional domain.Calculating, new terminology is similar to the seed term in dictionary During degree wsim (NewTerm, Term), experiment repeatedly shows that result is best during α=0.6, and now the accuracy of identification of multiword new terminology is 93.8%.Therefore, the present invention achieves preferable recognition performance, has reached the purpose of practical application, is dictionary writing, divides The substantial amounts of application such as word application, text classification, the analysis of public opinion, advertisement analysis, has established solid foundation.

Claims (2)

1. a kind of efficient new terminology identifying system, it is characterised in that:Including to every document in input text library RCorpus Participle is carried out, text word sequence modules A is formed;Every document sequence of terms in text library TCorpus after participle is carried out New terminology identification module B;New terminology to recognizing carries out authentication module C;
In module described above, modules A carries out participle to every text gear in input text library RCorpus, forms the text after participle This word sequence, therefore the text library TCorpus after participle is formed, used for new terminology identification module B;New terminology identification module B New terminology identification is carried out to every document in the text library TCorpus after participle, one group of new terminology result to be verified is formed, Used for authentication module C;Authentication module C is further verified to the new terminology that new terminology identification module B is recognized.
2. a kind of efficient new terminology recognition methods, it is characterised in that:Comprise the following steps:
The first step:Text word sequence modules A carries out participle to every text in input text library RCorpus, forms text word Sequence;
We carry out participle, word segmentation result using an ICTCLAS system increased income to every input text D in RCorpus It is T '=W1/pos1W2/pos2…Wi/posi…Wn/posn, wherein each WiBe a Chinese word, Chinese character, punctuation mark, Arabic numerals, English word or letter, posiIt is its corresponding part of speech;
In order to represent difference, every text text in RCorpus by after participle, produced text, we are designated as TCorpus;
Second step:New terminology identification module B carries out new art to every text word sequence in the text library TCorpus after participle Language is recognized;
Current text to be identified is Di, TiIt is its title, SijIt is DiJ-th strip sentence current to be identified;To SijCarry out with The treatment of lower step, forms the new terminology result of candidate, is stored in set tmp_result:
Step B1:It is sky to set tmp_result;
Tmp_result is used to deposit the new terminology result for identifying, passes to authentication module C and is verified.Therefore, New terminology result in tmp_result is also referred to as the new terminology result of candidate, new terminology result also referred to as to be verified;
Step B2:By SijIn continuous most long, part of speech labeled as a, b, j, n, m, q morphology into candidate's new terminology, It is designated as NewTerm;" continuously most long ", refers in SijThe two ends of middle NewTerm do not have the word that part of speech is a, b, j, n;
Step B3:If in SijIn the part of speech of the and then word W of NewTerm be k, i.e. W is probably the suffix of NewTerm, NewTerm=NewTerm ⊕ W are then set;
Step B4:If in SijIn the part of speech of word W that is located at before NewTerm be h, i.e. after W is probably NewTerm Sew, then NewTerm=W ⊕ NewTerm are set;
Step B5:By (NewTerm, Ti, Sij) be put into tmp_result;
3rd step:Authentication module C is verified to the new terminology for recognizing;
The groundwork of authentication module C is, using multi-source proof method, special proof method, new terminology identification module B to be produced New terminology in tmp_result is verified that authenticated new terminology is put into set result;The method of authentication module C It is as follows:
Step C1:It is sky to set result;
Step C2:Every a pair (NewTerm, T in tmp_resulti, Sij) circulate and be following steps C3, C4 and C5;
Step C3:If there is (NewTerm, T in tmp_resulti', Sij'), and TiWith Ti' difference " i.e. NewTerm In appearing in two different texts in TCorpus ", then NewTerm is put into result;Otherwise, step is performed C4;
As described in above-mentioned step C3, although NewTerm is in entitled TiSentence SijIn be identified as candidate's new terminology, but It is that NewTerm might not be exactly a correct new terminology;But, in entitled Ti' sentence Sij' in be also identified as it is new Term, then NewTerm is that the possibility of correct new terminology can be greatly promoted;
Step C4:If there is a seed term Term in seed dictionary so that NewTerm is similar to the weighting of Term Degree wsim (NewTerm, Term)>α, wherein α ∈ [0,1] are a threshold value), then NewTerm is put into result;It is no Then, step C5 is performed;
To provide two calculating of the Weighted Similarity wsim (NewTerm, Term) of term, we first provide function 2gram's Computational methods;To a non-NULL Chinese character string Sent=C1C2…Ci-1Ci…CK-1CK, wherein CiIt is Chinese character, numeral, English alphabet, I Introduce one take the lead tail tag note Chinese character string Sent=$ C1C2…Ci-1Ci…CK-1CK$;2gram (Sent) be one by Sent from The set that right continuous two characters of left-hand are constituted, i.e. 2gram (Sent)={ C1,C1C2,…,Ck-1CK,CK$};
It is pointed out that the importance of each element is differed in 2gram (Sent):Ci-1CiWhen being a word in Chinese, Ci-1Ci It is bigger in the effect of 2gram (Sent);In order to reflect the importance of each element in 2gram (Sent), to previously defined Interset(S1,S2) be improved, a new radix is introduced, it is called weighting common factor radix WInterset (S1,S2);Its meter Calculation method is as follows:To given two set S1And S2
(1)WInterset(S1,S2)=0;
(2) to Interset (S1,S2) each element e, if e is a word in Chinese, WInterset (S1,S2)= WInterset(S1,S2)+1.2, i.e. WInterset (S1,S2) cumulative 1.2, rather than 1;Otherwise WInterset (S1,S2)= WInterset(S1,S2)+1, i.e. WInterset (S1,S2) cumulative 1;
The computational methods of wsim (NewTerm, Term) are as follows:
(1) if NewTerm and Term has identical prefix and suffix, wsim (NewTerm, Term)=1;
(2) if NewTerm and Term does not have identical prefix and suffix, w s i m ( N e w T e r m , T e r m ) = W I n t e r s e t ( 2 g r a m ( N e w T e r m ) , 2 g r a m ( T e r m ) ) | U n i o n ( 2 g r a m ( N e w T e r m ) 2 g r a m ( T e r m ) ) | ;
Step C5:Using NewTerm in SijLinguistic context verified;Specific method is:When NewTerm is in SijAbove point The part of speech of word is one of c, d, p, r, u, z, and NewTerm is in SijThe part of speech of participle below be c, d, p, r, u, For the moment, NewTerm is a correct new terminology to z, is added in result;Otherwise abandon, that is, be added without result In;
Step C6:Result is as end product for output.
CN201510845390.7A 2015-11-27 2015-11-27 New term recognition method Active CN106815187B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510845390.7A CN106815187B (en) 2015-11-27 2015-11-27 New term recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510845390.7A CN106815187B (en) 2015-11-27 2015-11-27 New term recognition method

Publications (2)

Publication Number Publication Date
CN106815187A true CN106815187A (en) 2017-06-09
CN106815187B CN106815187B (en) 2020-04-14

Family

ID=59101945

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510845390.7A Active CN106815187B (en) 2015-11-27 2015-11-27 New term recognition method

Country Status (1)

Country Link
CN (1) CN106815187B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112084979A (en) * 2020-09-14 2020-12-15 武汉轻工大学 Food component identification method, device, equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102693244A (en) * 2011-03-23 2012-09-26 日电(中国)有限公司 Method and device for identifying information in non-structured text
CN102708147A (en) * 2012-03-26 2012-10-03 北京新发智信科技有限责任公司 Recognition method for new words of scientific and technical terminology
CN102955771A (en) * 2011-08-18 2013-03-06 华东师范大学 Technology and system for automatically recognizing Chinese new words in single-word-string mode and affix mode

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102693244A (en) * 2011-03-23 2012-09-26 日电(中国)有限公司 Method and device for identifying information in non-structured text
CN102955771A (en) * 2011-08-18 2013-03-06 华东师范大学 Technology and system for automatically recognizing Chinese new words in single-word-string mode and affix mode
CN102708147A (en) * 2012-03-26 2012-10-03 北京新发智信科技有限责任公司 Recognition method for new words of scientific and technical terminology

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
BLANCA FLORENTINO-LIANO 等: "LONG TERM HUMAN ACTIVITY RECOGNITION WITH AUTOMATIC ORIENTATION ESTIMATION", 《2012 IEEE INTERNATIOANAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL RROCESSING》 *
张榕: "术语定义抽取、聚类与术语识别研究", 《中国优秀博硕士学位论文全文数据库(博士)》 *
王卫民: "基于种子扩充的专业术语识别方法研究", 《计算机应用研究》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112084979A (en) * 2020-09-14 2020-12-15 武汉轻工大学 Food component identification method, device, equipment and storage medium
CN112084979B (en) * 2020-09-14 2023-07-11 武汉轻工大学 Food ingredient identification method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN106815187B (en) 2020-04-14

Similar Documents

Publication Publication Date Title
Riaz Rule-based named entity recognition in Urdu
Cortes A comparative analysis of lexical bundles in academic history writing in English and Spanish
Gupta et al. A survey of common stemming techniques and existing stemmers for indian languages
Karim Technical challenges and design issues in bangla language processing
Shahi et al. Natural language processing for Nepali text: a review
Hämäläinen et al. Finding Sami cognates with a character-based NMT approach
Ali et al. SiNER: A large dataset for Sindhi named entity recognition
Xin et al. An improved graph model for Chinese spell checking
Khan et al. Urdu word segmentation using machine learning approaches
Jain et al. Detection and correction of non word spelling errors in Hindi language
Kumar Saha et al. Named entity recognition in Hindi using maximum entropy and transliteration
Seon et al. Named Entity Recognition using Machine Learning Methods and Pattern-Selection Rules.
Phadte et al. Towards normalising Konkani-English code-mixed social media text
Shishtla et al. A character n-gram based approach for improved recall in Indian language NER
Sasidhar et al. Named entity recognition in telugu language using language dependent features and rule based approach
Myint et al. Disambiguation using joint entropy in part of speech of written Myanmar text
CN106815187A (en) A kind of efficient new terminology identifying system and method
Eshetu et al. Learning word and sub-word vectors for Amharic (Less Resourced Language)
CN111553155B (en) Password word segmentation system and method based on semantic structure
Wang et al. Mongolian named entity recognition system with rich features
Saito et al. Multi-language named-entity recognition system based on HMM
Tongtep et al. Multi-stage automatic NE and pos annotation using pattern-based and statistical-based techniques for thai corpus construction
Tran et al. Improving vietnamese word segmentation and pos tagging using mem with various kinds of resources
Aichaoui et al. SPIRAL: SP ell I ng e R ror Parallel Corpus for A rabic L anguage
May et al. Surprise! What's in a Cebuano or Hindi Name?

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 212009 Zhenjiang high tech Industrial Development Zone, Jiangsu, No. 668, No. twelve road.

Applicant after: Zhongke national power (Zhenjiang) Intelligent Technology Co., Ltd.

Address before: 212009 18 building, North Tower, Twin Tower Rd 468, twelve road 468, Ding Mo Jing, Jiangsu.

Applicant before: Knowology Intelligent Technology Co., Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant
CB03 Change of inventor or designer information

Inventor after: Fu Jianhui

Inventor after: Wang Weimin

Inventor after: Cao Yang

Inventor before: Fu Jianhui

Inventor before: Wang Weiming

Inventor before: Cao Yang

CB03 Change of inventor or designer information