CN106815187A

CN106815187A - A kind of efficient new terminology identifying system and method

Info

Publication number: CN106815187A
Application number: CN201510845390.7A
Authority: CN
Inventors: 符建辉; 王卫明; 曹阳
Original assignee: KNOWOLOGY INTELLIGENT TECHNOLOGY Co Ltd
Current assignee: KNOWOLOGY INTELLIGENT TECHNOLOGY Co Ltd
Priority date: 2015-11-27
Filing date: 2015-11-27
Publication date: 2017-06-09
Anticipated expiration: 2035-11-27
Also published as: CN106815187B

Abstract

The present invention relates to a kind of efficient new terminology identifying system and method, its system includes carrying out participle to every document in input text library RCorpus, forms text word sequence modules A；New terminology identification module B is carried out to every document sequence of terms in the text library TCorpus after participle；New terminology to recognizing carries out authentication module C；Its method is comprised the following steps：The first step：Text word sequence modules A carries out participle to every text in input text library RCorpus, forms text word sequence；Second step：New terminology identification module B carries out new terminology identification to every text word sequence in the text library TCorpus after participle；3rd step：Authentication module C is verified to the new terminology for recognizing；The present invention proposes the new terminology recognition methods high of a kind of high precision, recall rate and system.The accuracy of identification of new terminology is 93.8%.

Description

A kind of efficient new terminology identifying system and method

Technical field

The present invention relates to Chinese natural language process, Chinese new words automatic identification field, more particularly to a kind of new terminology is automatic Identifying system and method.

Background technology

With the fast development of internet, all kinds of new terminologies emerge in an endless stream, this natural language processing application, automatically using soft Part (such as Words partition system), dictionary are incorporated portable lamp into own forces and bring very big difficulty.

The research of new terminology identification has been expanded for many years.Existing method has following three class.First Statistics-Based Method. For example, Kenneth Ward Church and B é atrice Daille et al. use mutual information (Mutual Information) To extract fixed Combination and the collocation of word, they think the neighbouring character combination typically all term of frequent co-occurrence, then make The co-occurrence degree of phrase is judged with mutual information.And for example, Ted Dunning and Jonathan D.Cohen et al. use logarithm Likelihood ratio (Log-Likelihood Ratio) counts the identification problem of low frequency word, from theoretical and true two aspects demonstration The validity of this method.Statistical method also includes maximum matching method, Hidden Markov method, maximum entropy method Deng.Second is the method based on linguistic feature and lexical pattern.For example, Liu Lei, Wang Shi and Tian Guogang use multiple features, With reference to morphology and syntactic pattern, new technical term is obtained.3rd is the integrated application of first two method, thus is overcome each From deficiency.

But, by detailed experimental analysis, the above method deposits following two problems.

Problem 1：New terminology accuracy of identification problem.Using the method for pure statistics, although more new terminology can be recognized, but Substantial amounts of mistake would generally be introduced；That is, not being the Chinese character string of new terminology, new terminology is mistaken for.For example, in sentence In " general headquarters organize cadre to learn spirit of the Central Committee ", during using statistical method, it is easy to just by " general headquarters organize cadre ", " group Knit cadre ", " cadre's study " etc. is known by mistake is new terminology, and substantially, they are not.On the other hand, it is to be ensured that new art Language amount accuracy of identification is very high, and apprehension span is restricted again.This is one of key issue that the present invention needs solution.

Problem 2：New terminology apprehension span problem.Because the combined situation of word is a lot, therefore new terminology automatic identification is held very much Easily omit new significant term.Therefore, how to improve apprehension span is an important problem.This is also that the present invention is needed One of key issue to be solved.

The content of the invention

The technical problems to be solved by the invention：New terminology accuracy of identification problem, apprehension span problem.

For problem 1, invention introduces seed glossary technology, the identification of new terminology is not carried out merely with seed words allusion quotation, The new terminology of new acquisition is also verified with it simultaneously.

For problem 2, invention introduces the iterative new terminology identification technology of multi-source.First, using multi-source analysis method, According to multiple texts lifted according to checking the accuracy of identification of new terminology；The new terminology of acquisition is added to seed art simultaneously In dictionary, constantly recycle, so as to obtain more new terminologies.

To achieve these goals, the invention provides following technical scheme：A kind of efficient new terminology identifying system, its feature exists In：Including carrying out participle to every document in input text library RCorpus, text word sequence modules A is formed；After participle Text library TCorpus in every document sequence of terms carry out new terminology identification module B；New terminology to recognizing is verified Module C；

In module described above, modules A carries out participle to every text gear in input text library RCorpus, forms the text after participle This word sequence, therefore the text library TCorpus after participle is formed, used for new terminology identification module B；New terminology identification module B New terminology identification is carried out to every document in the text library TCorpus after participle, one group of new terminology result to be verified is formed, Used for authentication module C；Authentication module C is further verified to the new terminology that new terminology identification module B is recognized.

A kind of efficient new terminology recognition methods, it is characterised in that：Comprise the following steps：

The first step：Text word sequence modules A carries out participle to every text in input text library RCorpus, forms text word Sequence；

We carry out participle, word segmentation result using an ICTCLAS system increased income to every input text D in RCorpus It is T '=W₁/pos₁W₂/pos₂…W_i/pos_i…W_n/pos_n, wherein each W_iBe a Chinese word, Chinese character, punctuation mark, Arabic numerals, English word or letter, pos_iIt is its corresponding part of speech；

In order to represent difference, every text text in RCorpus by after participle, produced text, we are designated as TCorpus；

Second step：New terminology identification module B carries out new art to every text word sequence in the text library TCorpus after participle Language is recognized；

Current text to be identified is D_i, T_iIt is its title, S_ijIt is D_iJ-th strip sentence current to be identified；To S_ijCarry out with The treatment of lower step, forms the new terminology result of candidate, is stored in set tmp_result：

Step B1：It is sky to set tmp_result；

Tmp_result is used to deposit the new terminology result for identifying, passes to authentication module C and is verified.Therefore, New terminology result in tmp_result is also referred to as the new terminology result of candidate, new terminology result also referred to as to be verified；

Step B2：By S_ijIn continuous most long, part of speech labeled as a, b, j, n, m, q morphology into candidate's new terminology, It is designated as NewTerm；" continuously most long ", refers in S_ijThe two ends of middle NewTerm do not have the word that part of speech is a, b, j, n；

Step B3：If in S_ijIn the part of speech of the and then word W of NewTerm be k, i.e. W is probably the suffix of NewTerm, Then set

Step B4：If in S_ijIn the part of speech of word W that is located at before NewTerm be h, i.e. after W is probably NewTerm Sew, then set

Step B5：By (NewTerm, T_i, S_ij) be put into tmp_result；

3rd step：Authentication module C is verified to the new terminology for recognizing；

The groundwork of authentication module C is, using multi-source proof method, special proof method, new terminology identification module B to be produced New terminology in tmp_result is verified that authenticated new terminology is put into set result；The method of authentication module C It is as follows：

Step C1：It is sky to set result；

Step C2：Every a pair (NewTerm, T in tmp_result_i, S_ij) circulate and be following steps C3, C4 and C5；

Step C3：If there is (NewTerm, T in tmp_result_i', S_ij'), and T_iWith T_i' difference " i.e. NewTerm In appearing in two different texts in TCorpus ", then NewTerm is put into result；Otherwise, step is performed C4；

As described in above-mentioned step C3, although NewTerm is in entitled T_iSentence S_ijIn be identified as candidate's new terminology, but It is that NewTerm might not be exactly a correct new terminology；But, in entitled T_i' sentence S_ij' in be also identified as it is new Term, then NewTerm is that the possibility of correct new terminology can be greatly promoted；

Step C4：If there is a seed term Term in seed dictionary so that NewTerm is similar to the weighting of Term Degree wsim (NewTerm, Term) ＞ α, wherein α ∈ [0,1] are a threshold value), then NewTerm is put into result；It is no Then, step C5 is performed；

To provide two calculating of the Weighted Similarity wsim (NewTerm, Term) of term, we first provide function 2gram's Computational methods；To a non-NULL Chinese character string Sent=C₁C₂…C_i-1C_i…C_K-1C_K, wherein C_iIt is Chinese character, numeral, English alphabet, I Introduce one take the lead tail tag note Chinese character string Sent=$ C₁C₂…C_i-1C_i…C_K-1C_K$；2gram (Sent) be one by Sent from The set that right continuous two characters of left-hand are constituted, i.e. 2gram (Sent)={ C₁, C₁C₂..., C_k-1C_K, C_K$}；

It is pointed out that the importance of each element is differed in 2gram (Sent)：C_i-1C_iWhen being a word in Chinese, C_i-1C_i It is bigger in the effect of 2gram (Sent)；In order to reflect the importance of each element in 2gram (Sent), to previously defined Interset(S₁, S₂) be improved, a new radix is introduced, it is called weighting common factor radix WInterset (S₁, S₂)；Its meter Calculation method is as follows：To given two set S₁And S₂：

(1)WInterset(S₁, S₂)=0；

(2) to Interset (S₁, S₂) each element e, if e is a word in Chinese, WInterset (S₁, S₂)= WInterset(S₁, S₂)+1.2, i.e. WInterset (S₁, S₂) cumulative 1.2, rather than 1；Otherwise WInterset (S₁, S₂)= WInterset(S₁, S₂)+1, i.e. WInterset (S₁, S₂) cumulative 1；

The computational methods of wsim (NewTerm, Term) are as follows：

(1) if NewTerm and Term has identical prefix and suffix, wsim (NewTerm, Term)=1；

(2) if NewTerm and Term does not have identical prefix and suffix,

Step C5：Using NewTerm in S_ijLinguistic context verified；Specific method is：When NewTerm is in S_ijAbove point The part of speech of word is one of c, d, p, r, u, z, and NewTerm is in S_ijThe part of speech of participle below be c, d, p, r, u, For the moment, NewTerm is a correct new terminology to z, is added in result；Otherwise abandon, that is, be added without result In；

Step C6：Result is as end product for output.

Beneficial effect：The present invention proposes the new terminology recognition methods high of a kind of high precision, recall rate and system.By up to In the test of 2GB page language materials, except News Field, various industries and professional domain are also covered by.The accuracy of identification of new terminology is 93.8%.Therefore, the present invention achieves preferable recognition performance, has reached the purpose of practical application, is dictionary writing, divides The substantial amounts of application such as word application, text classification, the analysis of public opinion, advertisement analysis, has established solid foundation.

Brief description of the drawings

Fig. 1 is the workflow diagram of new terminology identifying system of the invention；

Fig. 2 is the method for work schematic diagram of module B；

Fig. 3 is the method for work schematic diagram of module C.

Specific embodiment

In order to the clearer explanation present invention, the defined below and term that is explained as follows：

(1) word is long：That is the length of word.One Chinese word is made by one or more Chinese characters, and a length for word is equal to is somebody's turn to do The number of the word contained by word.The word of word a length of 1 is referred to as a words, and the word of word a length of 2 is referred to as two words, and the word of word a length of 3 claims It is three words, by that analogy.

(2) multi-character words：Be made up of the Chinese character of word a length of 3 or more, the referred to as multi-character words of the word with definite meaning, for example " in Common spirit ", " positive energy ", wherein the former is four words, and the latter is three words.

(3) dictionary：The list of the word being made up of one group of word, word therein can be monosyllabic word (i.e. a length of 1) of word, two words Word (i.e. a length of 2) of word or multi-character words.

(4) new terminology：A glossary is given, the term not occurred in this dictionary is referred to as new terminology, does not include also referred to as Term.

(5) text word sequence, sentence word sequence：Participle is carried out to a text, a sequence for word is formed, referred to as text This word sequence, abbreviation word sequence.When text is a sentence, we are sometimes referred to as sentence word sequence.It is clear in context In the case of, sentence word sequence is also referred to as word sequence by us.

(6) ICTCLAS systems：One Words partition system that is free, increasing income.The system is input with text, is output as this article This segmentation sequence.ICTCLAS system downloads network address is：http：//ictclas.nlpir.org.After participle, each participle Part of speech is indicated, wherein a represents that adjective, b represent that distinction word, c represent that conjunction, d represent that adverbial word, h represent that prefix word, j are represented Abbreviation word, k represent that suffix word, m represent that number, n represent that noun, p represent that preposition, q represent that measure word, r represent that pronoun, u are represented Auxiliary word, z represent descriptive word, etc..

(7) text：An either short sentence, or article long, we are referred to as text.For simplicity, In the description, every webpage that we mention, without other especially explain in the case of, refer both to through removal html tag, CSS codes (i.e. CSS, Full Name in English Cascading Style Sheets), DIV codes are (i.e. in CSS Location technology, Full Name in English Division) and the resulting plain text such as JS codes (Full Name in English javascripts).

(8) string-concatenation：Given any two character string V₁And V₂, V₁And V₂Be spliced into by they be seamlessly connected together The new character string for obtaining, is designated asFor example, V₁=general, V₂=analysis,

(9) intersection of sets collection, union, radix：Give two set S₁And S₂, their union is designated as Interset (S₁, S₂), It is to appear in S by those₁In and also appear in S simultaneously₂In element constitute set.For example, S₁={ notebook, computer }, S₂={ pen Note this, computer, 4G }, then Interset (S₁, S₂)={ notebook, computer }.Since appearing in S₁Or appear in S₂In element The collection of composition is combined into S₁And S₂Union, be designated as Union (S₁, S₂).For example, S₁={ notebook, computer }, S₂={ notebook, electricity Brain, 4G }, then Union (S₁, S₂)={ notebook, computer, 4G }.The number of element is cardinality of a set, S in set₁Radix It is designated as | S₁|, such as S₁={ notebook, computer }, | S₁|=2.

The present invention is described in more detail with reference to the accompanying drawings and detailed description.

A kind of efficient new terminology identifying system proposed by the present invention and method, with a text set RCorpus as input (referred to as Input text library), it is operated using following three modules：

Modules A：Participle is carried out to every text in input text library RCorpus, text word sequence is formed.

Module B：New terminology identification is carried out to every text word sequence in the text library TCorpus after participle.

Module C：New terminology to recognizing is verified.

We carry out participle, word segmentation result using an ICTCLAS system increased income to every input text D in RCorpus It is T '=W₁/pos₁W₂/pos₂…W_i/pos_i…W_n/pos_n, wherein each W_iBe a Chinese word, Chinese character, punctuation mark, Arabic numerals, English word or letter, pos_iIt is its corresponding part of speech.

In participle, the mark of part of speech is current in computer circle.Common part of speech has n (title), v (verb), a (adjective), d (adverbial word), p (preposition) etc..For example, the participle of sentence " general headquarters organize cadre to learn spirit of the Central Committee " Result is：" general headquarters/n tissue/n cadres/n study/v centers/n spirit/n ".

In order to represent difference, every text in RCorpus by after participle, produced text, we are designated as TCorpus.

The specific implementation of module B, it is as described below：

Assuming that current text to be identified is D_i, T_iIt is its title, S_ijIt is D_iJ-th strip sentence current to be identified.To S_ij The treatment for following the steps below, forms the new terminology result (new terminology result also referred to as to be verified) of candidate, is stored in set In tmp_result：

Step B1：It is sky to set tmp_result.

Tmp_result is used to deposit the new terminology result for identifying, passes to authentication module C and is verified.Therefore, New terminology result in tmp_result is also referred to as the new terminology result (new terminology result also referred to as to be verified) of candidate.

Step B2：By S_ijIn continuous most long, part of speech labeled as a, b, j, n, m, q morphology into candidate's new terminology, It is designated as NewTerm.

" continuously most long " described in step B2, refers in S_ijThe two ends of middle NewTerm do not have the word that part of speech is a, b, j, n.

Step B3：If in S_ijIn the part of speech of the and then word W of NewTerm be k (i.e. W is probably the suffix of NewTerm), Then set

Step B4：If in S_ijIn be located at the part of speech h of the word W before NewTerm (i.e. after W is probably NewTerm Sew), then set

Step B5：By (NewTerm, T_i, S_ij) be put into tmp_result.

Module C：New terminology to recognizing is verified.

The groundwork of this module is the art in the tmp_result produced to module B using multi-source proof method, special proof method Language is verified that authenticated new terminology is put into set result.The method of module C is as follows：

Step C1：It is sky to set result.

Step C2：Every a pair (NewTerm, T in tmp_result_i, S_ij) circulate and be following steps C3, C4 and C5.

Step C3：If there is (NewTerm, T in tmp_result_i', S_ij'), and T_iWith T_i' difference (i.e. NewTerm In appearing in two different texts in TCorpus), then NewTerm is put into result；Otherwise, step is performed C4。

As described in above-mentioned step C3, although NewTerm is in entitled T_iSentence S_ijIn be identified as candidate's new terminology, but It is that NewTerm might not be exactly a correct new terminology.But, in entitled T_i' sentence S_ij' in be also identified as it is new Term, then NewTerm is that the possibility of correct new terminology can be greatly promoted.(certainly, if also exist the 3rd it is different Text come aid in verify NewTerm, NewTerm is that the possibility of correct new terminology can be higher, but can so reduce new terminology The broadband of identification, experiment shows, it is not necessary to the 3rd checking of different texts.)

Step C4：If there is a seed term Term in seed dictionary so that NewTerm is similar to the weighting of Term , then be put into NewTerm in result by degree wsim (NewTerm, Term) ＞ α (α ∈ [0,1] are a threshold value)；Otherwise, Perform step C5.

To provide two calculating of the Weighted Similarity wsim (NewTerm, Term) of term, we first provide the meter of function 2gram Calculation method.To a non-NULL Chinese character string Sent=C₁C₂…C_i-1C_i…C_K-1C_K, wherein C_iIt is Chinese character, numeral, English alphabet, we Introduce a Chinese character string Sent=$ C for taking the lead tail tag note₁C₂…C_i-1C_i…C_K-1C_K$.2gram (Sent) is one by certainly left in Sent To the set that right continuous two characters are constituted, i.e. 2gram (Sent)={ C₁, C₁C₂..., C_k-1C_K, C_K$}。

It is pointed out that the importance of each element is identical in 2gram (Sent)：C_i-1C_iWhen being a word in Chinese, C_i-1C_i It is bigger in the effect of 2gram (Sent).In order to reflect the importance of each element in 2gram (Sent), the present invention is to above fixed Interset (the S of justice₁, S₂) be improved, a new radix is introduced, it is called weighting common factor radix WInterset (S₁, S₂)。 Its computational methods is as follows：To given two set S₁And S₂：

(1)WInterset(S₁, S₂)=0；

(2) to Interset (S₁, S₂) each element e, if e is a word in Chinese, WInterset (S₁, S₂)= WInterset(S₁, S₂)+1.2 (i.e. WInterset (S₁, S₂) cumulative 1.2, rather than 1)；Else if e is not a Chinese In word, then WInterset (S₁, S₂)=WInterset (S₁, S₂)+1 (i.e. WInterset (S₁, S₂) add up 1)；

The computational methods of wsim (NewTerm, Term) are as follows：

(2) if NewTerm and Term does not have identical prefix and suffix,

To the first situation of above-mentioned formula, we provide an example as explanation.Make NewTerm=" in it is totally ten eight big by 4th Secondary meeting ", Term=" in altogether Luochuan meeting ".Now wsim (" in the totally ten eight big fourth session ", " and in Luochuan meeting altogether View ")=1, because Term and NewTerm have common prefix and suffix.

Step C5：Using NewTerm in S_ijLinguistic context verified.Specific method is：When NewTerm is in S_ijAbove point The part of speech of word is one of c, d, p, r, u, z, and NewTerm is in S_ijThe part of speech of participle below be c, d, p, r, u, For the moment, NewTerm is a correct new terminology to z, is added in result；Otherwise abandon, that is, be added without result In.

Step C6：Result is as end product for output.

Experiment effect

The present invention proposes a kind of recognition methods of efficient multiword new terminology and system.In the test by up to 2GB pages language material In, except News Field, it is also covered by various industries and professional domain.Calculating, new terminology is similar to the seed term in dictionary During degree wsim (NewTerm, Term), experiment repeatedly shows that result is best during α=0.6, and now the accuracy of identification of multiword new terminology is 93.8%.Therefore, the present invention achieves preferable recognition performance, has reached the purpose of practical application, is dictionary writing, divides The substantial amounts of application such as word application, text classification, the analysis of public opinion, advertisement analysis, has established solid foundation.

Claims

1. a kind of efficient new terminology identifying system, it is characterised in that：Including to every document in input text library RCorpus Participle is carried out, text word sequence modules A is formed；Every document sequence of terms in text library TCorpus after participle is carried out New terminology identification module B；New terminology to recognizing carries out authentication module C；

2. a kind of efficient new terminology recognition methods, it is characterised in that：Comprise the following steps：

Step B1：It is sky to set tmp_result；

Step B3：If in S_ijIn the part of speech of the and then word W of NewTerm be k, i.e. W is probably the suffix of NewTerm, NewTerm=NewTerm ⊕ W are then set；

Step B4：If in S_ijIn the part of speech of word W that is located at before NewTerm be h, i.e. after W is probably NewTerm Sew, then NewTerm=W ⊕ NewTerm are set；

Step B5：By (NewTerm, T_i, S_ij) be put into tmp_result；

Step C1：It is sky to set result；

Step C4：If there is a seed term Term in seed dictionary so that NewTerm is similar to the weighting of Term Degree wsim (NewTerm, Term)>α, wherein α ∈ [0,1] are a threshold value), then NewTerm is put into result；It is no Then, step C5 is performed；

To provide two calculating of the Weighted Similarity wsim (NewTerm, Term) of term, we first provide function 2gram's Computational methods；To a non-NULL Chinese character string Sent=C₁C₂…C_i-1C_i…C_K-1C_K, wherein C_iIt is Chinese character, numeral, English alphabet, I Introduce one take the lead tail tag note Chinese character string Sent=$ C₁C₂…C_i-1C_i…C_K-1C_K$；2gram (Sent) be one by Sent from The set that right continuous two characters of left-hand are constituted, i.e. 2gram (Sent)={ C₁,C₁C₂,…,C_k-1C_K,C_K$}；

It is pointed out that the importance of each element is differed in 2gram (Sent)：C_i-1C_iWhen being a word in Chinese, C_i-1C_i It is bigger in the effect of 2gram (Sent)；In order to reflect the importance of each element in 2gram (Sent), to previously defined Interset(S₁,S₂) be improved, a new radix is introduced, it is called weighting common factor radix WInterset (S₁,S₂)；Its meter Calculation method is as follows：To given two set S₁And S₂：

(1)WInterset(S₁,S₂)=0；

(2) to Interset (S₁,S₂) each element e, if e is a word in Chinese, WInterset (S₁,S₂)= WInterset(S₁,S₂)+1.2, i.e. WInterset (S₁,S₂) cumulative 1.2, rather than 1；Otherwise WInterset (S₁,S₂)= WInterset(S₁,S₂)+1, i.e. WInterset (S₁,S₂) cumulative 1；

The computational methods of wsim (NewTerm, Term) are as follows：

(2) if NewTerm and Term does not have identical prefix and suffix,

w s i m (N e w T e r m, T e r m)

= \frac{W I n t e r s e t (2 g r a m (N e w T e r m), 2 g r a m (T e r m))}{| U n i o n (2 g r a m (N e w T e r m) 2 g r a m (T e r m)) |};

Step C6：Result is as end product for output.