CN106815187B

CN106815187B - New term recognition method

Info

Publication number: CN106815187B
Application number: CN201510845390.7A
Authority: CN
Inventors: 符建辉; 王卫明; 曹阳
Original assignee: Zhongke Guoli Zhenjiang Intelligent Technology Co ltd
Current assignee: Zhongke Guoli Zhenjiang Intelligent Technology Co ltd
Priority date: 2015-11-27
Filing date: 2015-11-27
Publication date: 2020-04-14
Anticipated expiration: 2035-11-27
Also published as: CN106815187A

Abstract

The invention relates to a new term recognition system and method, the system includes dividing word to each document in input text library RCorpus, forming text word sequence module A; performing a new term recognition module B on each document word sequence in the segmented text library TCorpus; a verification module C for the identified new terms; the method comprises the following steps: the first step is as follows: the text word sequence module A carries out word segmentation on each text in an input text library RCorpus to form a text word sequence; the second step is that: the new term recognition module B carries out new term recognition on each text word sequence in the segmented text library TCorpus; the third step: the verification module C verifies the identified new term; the invention provides a new term identification method and system with high precision and high recall rate. The recognition accuracy of the new term is 93.8%.

Description

New term recognition method

Technical Field

The invention relates to the field of Chinese natural language processing and automatic identification of new Chinese words, in particular to an automatic identification method of new terms.

Background

Along with the rapid development of the internet, various new terms come out endlessly, and the natural language processing application, the automatic application software (such as a word segmentation system) and the dictionary collecting and writing work light bring great difficulty.

Research into the identification of new terms has been underway for many years. There are three types of existing methods. A first statistical-based approach. For example, Kenneth Ward Church and Beatrice Daille et al use mutual information (Mutualinformation) to extract fixed combinations and collocations of words, consider that frequently co-occurring adjacent character combinations are generally terms, and then use the mutual information to determine the co-occurrence degree of phrases. As another example, Ted Dunning and Jonathan d. Statistical methods also include conditional random field methods, hidden markov methods, maximum entropy methods, and the like. The second is a method based on linguistic features and lexical patterns. For example, Liu Lei, Wang Shi and Tian Guo just adopt multiple features, and combine lexical and syntactic patterns to obtain new professional terms. Thirdly, the integrated application of the first two methods overcomes the respective disadvantages.

However, the above method has the following two problems through detailed experimental analysis.

Problem 1: the new term identifies the accuracy problem. With a purely statistical approach, although more new terms can be identified, a large number of errors are usually introduced; that is, a Chinese string that is not a new term is mistakenly considered to be a new term. For example, in the case of the phrase "headquarter organizing trunk learning central spirit", when a statistical method is employed, it is easy to erroneously recognize "headquarter organizing trunk", "trunk learning", and the like as new terms, but essentially none of them is. On the other hand, the recognition accuracy of the new term limit is high, and the recognition scope is limited. This is one of the key problems that the present invention needs to solve.

Problem 2: new term identification breadth problem. Because the combination of words is many, the automatic recognition of new terms easily misses new meaningful terms. Therefore, how to improve the recognition breadth is an important issue. This is also one of the key problems that the present invention needs to solve.

Disclosure of Invention

The technical problems to be solved by the invention are as follows: the new term identification precision problem and the identification breadth problem.

In response to problem 1, the present invention introduces a seed term dictionary technique that not only utilizes a seed dictionary for the recognition of new terms, but also uses it to validate newly obtained new terms.

Aiming at the problem 2, the invention introduces a multi-source iterative new term identification technology. Firstly, a multi-source analysis method is adopted, comparison verification is carried out according to a plurality of texts, and the recognition precision of new terms is improved; and simultaneously, adding the obtained new terms into the seed term dictionary, and continuously recycling the new terms so as to obtain more new terms.

In order to achieve the purpose, the invention provides the following technical scheme:

a new term recognition method is characterized in that: the method comprises the following steps:

the first step is as follows: the text word sequence module A carries out word segmentation on each text in an input text library RCorpus to form a text word sequence;

an open source ICTCCLAS system is adopted to perform word segmentation on each input text D in RCorpus, and the word segmentation result is T' ═ W₁/pos₁W₂/pos₂… W_i/pos_i… W_n/pos_nWherein each W_iIs a Chinese word, Chinese character, punctuation mark, Arabic numeral, English word or letter, pos_iIs its corresponding part of speech;

for representing differences, each text in RCorpus is subjected to word segmentation, and the generated text is marked as TCorpus;

the second step is that: the new term recognition module B carries out new term recognition on each text word sequence in the segmented text library TCorpus;

the current text to be recognized is D_i，T_iFor its title, S_ijIs D_iThe current j-th statement to be identified; to S_ijProcessing the following steps is carried out, and a candidate new term result is formed and stored in the set tmp _ result:

step B1: set tmp _ result to null;

tmp _ result is used for storing the result of the identified new term and transmitting the result to the verification module C for verification. Therefore, the new term result in tmp _ result is also called candidate new term result, also called new term result to be verified;

step B2: will S_ijForming a candidate new term by the continuous longest word with the part of speech marked as a, b, j, n, m and q, and marking as NewTerm; the term "continuous longest" means at S_ijThe two ends of the middle NewTerm have no words with the parts of speech a, b, j and n;

step B3: if at S_ijThe part of speech of the word W immediately following NewTerm is k, i.e., W may beSuffix of NewTerm, then set

Step B4: if at S_ijThe part of speech of the word W in front of NewTerm is h, i.e. W may be the prefix of NewTerm, then the setting is made

Step B5: will (NewTerm, T)_i，S_ij) Put into tmp _ result;

the third step: the verification module C verifies the identified new term;

the verification module C mainly works by adopting a multi-source verification method and a special verification method to verify a new term in tmp _ result generated by the new term identification module B, and the verified new term is put into the set result; the method of verifying module C is as follows:

step C1: setting result to null;

step C2: for each of the pairs tmp _ result (NewTerm, T)_i，S_ij) Circularly performing the following steps C3, C4 and C5;

step C3: if present in tmp _ result (NewTerm, T)_i′，S_ij') and T)_iAnd T_i"different," i.e., NewTerm appears in two different texts in TCorpus, "NewTerm is placed in result; otherwise, step C4 is executed:

as described above in step C3, although NewTerm is entitled T_iSentence S_ijIs identified as a candidate new term, but NewTerm is not necessarily a correct new term; however, under the title T_iStatement S of `_ijIf new terms are also identified in' then the likelihood that NewTerm is the correct new term is greatly increased;

step C4, if a seed Term exists in the seed dictionary, so that the weighted similarity wsim (NewTerm, Term) of NewTerm and Term is more than α, wherein α belongs to [0, 1] as a threshold value, putting the NewTerm into result, otherwise, executing step C5;

to give a calculation of the weighted similarity wsim (NewTerm, Term) of the two terms, we first give a calculation method of the function 2 gram; for a non-empty Chinese character string set ═ C₁C₂…C_i-1C_i…C_K-1C_KIn which C is_iFor Chinese characters, numbers and English letters, we introduce a Chinese character string with head and tail marks₁C₂…C_i-1C_i…C_K-1C_K$ 3; 2gram (set) is a set of two consecutive characters from left to right in set, i.e., 2gram (set) { $ C₁，C₁C₂，…，C_k-1C_K，C_K$}；

It should be noted that the importance of each element in the 2gram (Sent) is not the same: c_i-1C_iWhen it is a word in Chinese, C_i-1C_iThe effect is greater at 2gram (Sent); to reflect the importance of each element in 2gram (Sent), for Interset (S)₁，S₂) Improved by introducing a new base called the weighted intersection base, Winterset (S)₁，S₂) (ii) a The calculation method is as follows: for given two sets S₁And S₂：

(1)WInterset(S₁，S₂)＝0；

(2) For Interset (S)₁，S₂) Each element e, if e is a word in Chinese, Winterset (S)₁，S₂)＝WInterset(S₁，S₂) +1.2, i.e. Winterset (S)₁，S₂) 1.2 instead of 1; otherwise WInterset (S)₁，S₂)＝WInterset(S₁，S₂) +1, i.e. Winterset (S)₁，S₂) Accumulating for 1;

the wsim (NewTerm, Term) is calculated as follows:

(1) if NewTerm has the same prefix and suffix as Term, wsim (NewTerm, Term) is 1;

(2) if NewTerm does not have the same prefix and suffix as Term,

wherein, the intersection, union and base number of the sets are as follows: given two sets S₁And S₂Their union is denoted as Interset (S)₁，S₂)；

Step C5: using NewTerm at S_ijVerifying the context of the user; the specific method comprises the following steps: when NewTerm is at S_ijThe part of speech of the preceding participle is one of c, d, p, r, u, z, and NewTerm is at S_ijWhen the part of speech of the following participles is one of c, d, p, r, u and z, NewTerm is a correct new term and is added into result; otherwise, abandoning, namely not adding into result;

step C6: and outputting result as a final result.

Has the advantages that: the invention provides a new term identification method and system with high precision and high recall rate. In tests with up to 2GB page corpora, various industries and professional areas are covered in addition to the news area. The recognition accuracy of the new term is 93.8%. Therefore, the invention obtains better recognition performance, achieves the aim of practical application, and lays a solid foundation for a large number of applications such as dictionary compiling, word segmentation application, text classification, public opinion analysis, advertisement analysis and the like.

Drawings

FIG. 1 is a workflow diagram of a new term identification method of the present invention;

FIG. 2 is a schematic diagram of the method of operation of module B;

fig. 3 is a schematic diagram of the operation method of module C.

Detailed Description

In order to be able to explain the invention more clearly, the following terms are defined and explained below:

(1) word length: i.e. the length of the word. A Chinese word is made up of one or more Chinese characters, and the length of a word is equal to the number of characters contained in the word. The word with the word length of 1 is called a word, the word with the word length of 2 is called a word of two, the word with the word length of 3 is called a word of three, and so on.

(2) Multi-word words: the Chinese character with the word length of 3 or more is called a multi-character word with certain meaning, such as 'middle spirit' and 'positive energy', wherein the former is a four-character word, and the latter is a three-character word.

(3) Dictionary: a list of words comprising a group of words, wherein the words may be single-word words (i.e. word length of 1), two-word words (i.e. word length of 2) or multi-word words.

(4) The new term: given a dictionary of terms, terms that do not appear in the dictionary are referred to as new terms, as well as unreported terms.

(5) Text word sequence and sentence word sequence: the word segmentation is performed on a text to form a word sequence, which is called a text word sequence, or simply a word sequence. When the text is a sentence, we also sometimes refer to the sequence of words of the sentence. In the clear context, we also refer to the sequence of words of the sentence simply as a sequence of words.

(6) CLAICTS system: a free, open source word segmentation system. The system takes a text as input and outputs a word segmentation sequence of the text. The downloading website of the CLAICTS system is as follows: http: // ictlas. nlpir. org. After word segmentation, each word segmentation is marked with part of speech, wherein a represents an adjective, b represents a distinct word, c represents a conjunctive word, d represents an adverb, h represents a prefix word, j represents an acronym, k represents a suffix word, m represents a numerator, n represents a noun, p represents a preposition word, q represents a quantifier, r represents a pronoun, u represents a helper word, z represents a status word, and the like.

(7) Text: whether it is a short sentence or a long article, we refer to the text as a whole. For simplicity, in the specification, all the web pages mentioned in the specification refer to plain texts obtained by removing HTML tags, CSS codes (i.e., Cascading Style Sheets, full english names Style Sheets), DIV codes (i.e., positioning technology in Cascading Style Sheets, full english names Division), JS codes (i.e., full english names javascript), and the like, without any special explanation.

(8) Character string splicing: given any two character strings V₁And V₂，V₁And V₂Is a new character string obtained by seamlessly connecting the characters together and is marked as

For example, V₁Am, V₂In the case of the analysis,

(9) intersection, union, cardinality of sets: given two sets S₁And S₂Their union is denoted as Interset (S)₁，S₂) Is caused by those appearing at S₁Is also present at S₂Is a collection of elements (a). For example, S₁(notebook, computer }, S)₂Set (S) when { notebook, computer, 4G }₁，S₂) = notebook, computer }. Now appear in S₁Or at S₂The set of elements in (1) is S₁And S₂Union of (1), denoted Union (S)₁，S₂). For example, S₁(notebook, computer }, S)₂(4G) { notebook, pc, Union (S) }₁，S₂) { notebook, computer, 4G }. The number of elements in the set is the cardinality of the set, S₁Is noted as | S₁L, e.g. S₁(ii) { notebook, computer }, | S₁|＝2。

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

The invention provides a new term recognition method, which takes a text set RCorpus as input (called an input text library) and adopts the following three modules to work:

a module A: and segmenting each text in the input text library RCorpus to form a text word sequence.

And a module B: and carrying out new term recognition on each text word sequence in the segmented text library TCorpus.

And a module C: the identified new term is validated.

An open source ICTCCLAS system is adopted to perform word segmentation on each input text D in RCorpus, and the word segmentation result is T' ═ W₁/pos₁W₂/pos₂… W_i/pos_i… W_n/pos_nWherein each W_iIs a Chinese word, Chinese character, punctuation mark, Arabic numeral, English word or letter, pos_iIs its corresponding part of speech.

In word segmentation, the tagging of parts of speech has been common in the computer world. Common parts of speech include n (name), v (verb), a (adjective), d (adverb), p (preposition), and the like. For example, the word segmentation result of the sentence "headquarters organization cadres learn central spirits" is: "headquarters/n organization/n trunks/n learning/v center/n spirit/n".

To illustrate the distinction, each text in RCorpus is segmented, and the resulting text is denoted as TCorpus.

The specific implementation of module B is as follows:

suppose that the current text to be recognized is D_i，T_iFor its title, S_ijIs D_iThe current j-th sentence to be identified. To S_ijThe following steps are performed to form candidate new term results (also called new term results to be verified), which are stored in the set tmp _ result:

step B1: tmp _ result is set to null.

tmp _ result is used for storing the result of the identified new term and transmitting the result to the verification module C for verification. Thus, the new term result in tmp _ result is also referred to as a candidate new term result (also referred to as a new term result to be verified).

Step B2: will S_ijThe continuous longest word in the list with part of speech marked as a, b, j, n, m and q forms a candidate new term, which is marked as NewTerm.

Described in step B2 "Continuous longest "means at S_ijThe two ends of middle NewTerm have no words with the parts of speech a, b, j and n.

Step B3: if at S_ijThe part of speech of the word W immediately following NewTerm is k (i.e., W may be a suffix of NewTerm), then the setting is set

Step B4: if at S_ijThe part of speech of the word W in front of NewTerm is h (i.e., W may be the prefix of NewTerm), then the setting is set

Step B5: will (NewTerm, T)_i，S_ij) Put into tmp _ result.

And a module C: the identified new term is validated.

The module mainly works by adopting a multi-source verification method and a special verification method to verify terms in tmp _ result generated by the module B, and the verified new terms are put into the set result. The method of module C is as follows:

step C1: result is set to null.

Step C2: for each of the pairs tmp _ result (NewTerm, T)_i，S_ij) The loop is made with the following steps C3, C4, and C5.

Step C3: if present in tmp _ result (NewTerm, T)_i′，S_ij') and T)_iAnd T_i' different (i.e., NewTerm appears in two different texts in TCorpus), then NewTerm is placed in result; otherwise, step C4 is performed.

As described above in step C3, although NewTerm is entitled T_iSentence S_ijIs identified as a candidate new term, but NewTerm is not necessarily a correct new term. However, under the title T_iStatement S of `_ijIf also identified as a new term in' then the likelihood that NewTerm is the correct new term is greatly increased. (of course, if there is a third different text to assist in verifying NewTerm, NewTerm is more likely to be the correct new term, but this reduces the bandwidth of new term recognition, experiments have shown that verification of a third different text is not required. )

Step C4, if a seed Term exists in the seed dictionary, so that the weighted similarity wsim (NewTerm, Term) > α of NewTerm and Term (α belongs to [0, 1] as a threshold), then put NewTerm in result, otherwise, execute step C5.

To give a calculation of the weighted similarity wsim (NewTerm, Term) of the two terms, we first give a calculation method of the function 2 gram. For a non-empty Chinese character string set ═ C₁C₂…C_i-1C_i…C_K-1C_KIn which C is_iFor Chinese characters, numbers and English letters, we introduce a Chinese character string with head and tail marks₁C₂…C_i-1C_i…C_K-1C_KAnd $ 3. 2gram (set) is a set of two consecutive characters from left to right in set, i.e., 2gram (set) { $ C₁，C₁C₂，…，C_k-1C_K，C_K$}。

It should be noted that the importance of each element in the 2gram (Sent) is the same: c_i-1C_iWhen it is a word in Chinese, C_i-1C_iThe effect is greater at 2gram (Sent). To reflect the importance of each element in the 2gram (Sent), the present invention targets Interset (S) as defined above₁，S₂) Improved by introducing a new base called the weighted intersection base, Winterset (S)₁，S₂). The calculation method is as follows: for given two sets S₁And S₂：

(1)WInterset(S₁，S₂)＝0；

(2) For Interset (S)₁，S₂) Each element e, if e is a word in Chinese, Winterset (S)₁，S₂)＝WInterset(S₁，S₂) +1.2 (i.e., Winterset (S)₁，S₂) Add 1.2 instead of 1); otherwise if e is not a word in Chinese, thenWInterset(S₁，S₂)＝WInterset(S₁，S₂) +1 (i.e. Winterset (S)₁，S₂) Accumulation 1);

the wsim (NewTerm, Term) is calculated as follows:

(2) if NewTerm does not have the same prefix and suffix as Term,

for the first case of the above formula, we give an example for explanation. Let NewTerm be "the eighteenth fourth meeting in total", and Term be "the lothane meeting in total". At this time wsim ("eighteenth conference in total, fourth conference in total, or" conference in total ") is 1 because Term and NewTerm have a common prefix and suffix.

Step C5: using NewTerm at S_ijIs verified. The specific method comprises the following steps: when NewTerm is at S_ijThe part of speech of the preceding participle is one of c, d, p, r, u, z, and NewTerm is at S_ijWhen the part of speech of the following participles is one of c, d, p, r, u and z, NewTerm is a correct new term and is added into result; otherwise it is discarded, i.e. not added to result.

Step C6: and outputting result as a final result.

Effect of the experiment

When the similarity wsim (New Term, Term) of new terms and seed terms in a dictionary is calculated, repeated experiments show that the result is the best when α is 0.6, and the recognition precision of the new terms of multiple words is 93.8%.

Claims

1. A new term recognition method is characterized in that: the method comprises the following steps:

step B1: set tmp _ result to null;

tmp _ result is used for storing the identified new term result and transmitting the new term result to the verification module C for verification; therefore, the new term result in tmp _ result is also called candidate new term result, also called new term result to be verified;

step B3: if at S_ijThe part of speech of the word W immediately following NewTerm is k, i.e. W may be a suffix of NewTerm, then the setting is made

Step B5: will (NewTerm, T)_i，S_ij) Put into tmp _ result;

the third step: the verification module C verifies the identified new term;

step C1: setting result to null;

It should be noted that the importance of each element in the 2gram (Sent) is not the same: c_i-1C_iWhen it is a word in Chinese, C_i-1C_iThe effect is greater at 2gram (Sent); to reflect the importance of each element in 2gram (Sent), for Interset (S)₁，S₂) Improved by introducing a new base called the weighted intersection base, Winterset (S)₁，S₂) Wherein; the calculation method is as follows: for given two sets S₁And S₂：(1)WInterset(S₁，S₂)＝0；

the wsim (NewTerm, Term) is calculated as follows:

(2) if NewTerm and Term do not have the same prefix sumThe ornament is composed of a main body and a back body,

step C6: outputting result as the final result;

wherein a represents an adjective, b represents a distinct word, c represents a conjunctive word, d represents an adverb, h represents a prefix word, j represents an acronym, k represents a suffix word, m represents a numerator, n represents a noun, p represents a preposition word, q represents a quantifier, r represents a pronoun, u represents a preposition word, and z represents a status word.