Disclosure of Invention
The technical problems to be solved by the invention are as follows: the new term identification precision problem and the identification breadth problem.
In response to problem 1, the present invention introduces a seed term dictionary technique that not only utilizes a seed dictionary for the recognition of new terms, but also uses it to validate newly obtained new terms.
Aiming at the problem 2, the invention introduces a multi-source iterative new term identification technology. Firstly, a multi-source analysis method is adopted, comparison verification is carried out according to a plurality of texts, and the recognition precision of new terms is improved; and simultaneously, adding the obtained new terms into the seed term dictionary, and continuously recycling the new terms so as to obtain more new terms.
In order to achieve the purpose, the invention provides the following technical scheme:
a new term recognition method is characterized in that: the method comprises the following steps:
the first step is as follows: the text word sequence module A carries out word segmentation on each text in an input text library RCorpus to form a text word sequence;
an open source ICTCCLAS system is adopted to perform word segmentation on each input text D in RCorpus, and the word segmentation result is T' ═ W1/pos1W2/pos2… Wi/posi… Wn/posnWherein each WiIs a Chinese word, Chinese character, punctuation mark, Arabic numeral, English word or letter, posiIs its corresponding part of speech;
for representing differences, each text in RCorpus is subjected to word segmentation, and the generated text is marked as TCorpus;
the second step is that: the new term recognition module B carries out new term recognition on each text word sequence in the segmented text library TCorpus;
the current text to be recognized is Di,TiFor its title, SijIs DiThe current j-th statement to be identified; to SijProcessing the following steps is carried out, and a candidate new term result is formed and stored in the set tmp _ result:
step B1: set tmp _ result to null;
tmp _ result is used for storing the result of the identified new term and transmitting the result to the verification module C for verification. Therefore, the new term result in tmp _ result is also called candidate new term result, also called new term result to be verified;
step B2: will SijForming a candidate new term by the continuous longest word with the part of speech marked as a, b, j, n, m and q, and marking as NewTerm; the term "continuous longest" means at SijThe two ends of the middle NewTerm have no words with the parts of speech a, b, j and n;
step B3: if at S
ijThe part of speech of the word W immediately following NewTerm is k, i.e., W may beSuffix of NewTerm, then set
Step B4: if at S
ijThe part of speech of the word W in front of NewTerm is h, i.e. W may be the prefix of NewTerm, then the setting is made
Step B5: will (NewTerm, T)i,Sij) Put into tmp _ result;
the third step: the verification module C verifies the identified new term;
the verification module C mainly works by adopting a multi-source verification method and a special verification method to verify a new term in tmp _ result generated by the new term identification module B, and the verified new term is put into the set result; the method of verifying module C is as follows:
step C1: setting result to null;
step C2: for each of the pairs tmp _ result (NewTerm, T)i,Sij) Circularly performing the following steps C3, C4 and C5;
step C3: if present in tmp _ result (NewTerm, T)i′,Sij') and T)iAnd Ti"different," i.e., NewTerm appears in two different texts in TCorpus, "NewTerm is placed in result; otherwise, step C4 is executed:
as described above in step C3, although NewTerm is entitled TiSentence SijIs identified as a candidate new term, but NewTerm is not necessarily a correct new term; however, under the title TiStatement S of `ijIf new terms are also identified in' then the likelihood that NewTerm is the correct new term is greatly increased;
step C4, if a seed Term exists in the seed dictionary, so that the weighted similarity wsim (NewTerm, Term) of NewTerm and Term is more than α, wherein α belongs to [0, 1] as a threshold value, putting the NewTerm into result, otherwise, executing step C5;
to give a calculation of the weighted similarity wsim (NewTerm, Term) of the two terms, we first give a calculation method of the function 2 gram; for a non-empty Chinese character string set ═ C1C2…Ci-1Ci…CK-1CKIn which C isiFor Chinese characters, numbers and English letters, we introduce a Chinese character string with head and tail marks1C2…Ci-1Ci…CK-1CK$ 3; 2gram (set) is a set of two consecutive characters from left to right in set, i.e., 2gram (set) { $ C1,C1C2,…,Ck-1CK,CK$};
It should be noted that the importance of each element in the 2gram (Sent) is not the same: ci-1CiWhen it is a word in Chinese, Ci-1CiThe effect is greater at 2gram (Sent); to reflect the importance of each element in 2gram (Sent), for Interset (S)1,S2) Improved by introducing a new base called the weighted intersection base, Winterset (S)1,S2) (ii) a The calculation method is as follows: for given two sets S1And S2:
(1)WInterset(S1,S2)=0;
(2) For Interset (S)1,S2) Each element e, if e is a word in Chinese, Winterset (S)1,S2)=WInterset(S1,S2) +1.2, i.e. Winterset (S)1,S2) 1.2 instead of 1; otherwise WInterset (S)1,S2)=WInterset(S1,S2) +1, i.e. Winterset (S)1,S2) Accumulating for 1;
the wsim (NewTerm, Term) is calculated as follows:
(1) if NewTerm has the same prefix and suffix as Term, wsim (NewTerm, Term) is 1;
(2) if NewTerm does not have the same prefix and suffix as Term,
wherein, the intersection, union and base number of the sets are as follows: given two sets S1And S2Their union is denoted as Interset (S)1,S2);
Step C5: using NewTerm at SijVerifying the context of the user; the specific method comprises the following steps: when NewTerm is at SijThe part of speech of the preceding participle is one of c, d, p, r, u, z, and NewTerm is at SijWhen the part of speech of the following participles is one of c, d, p, r, u and z, NewTerm is a correct new term and is added into result; otherwise, abandoning, namely not adding into result;
step C6: and outputting result as a final result.
Has the advantages that: the invention provides a new term identification method and system with high precision and high recall rate. In tests with up to 2GB page corpora, various industries and professional areas are covered in addition to the news area. The recognition accuracy of the new term is 93.8%. Therefore, the invention obtains better recognition performance, achieves the aim of practical application, and lays a solid foundation for a large number of applications such as dictionary compiling, word segmentation application, text classification, public opinion analysis, advertisement analysis and the like.
Detailed Description
In order to be able to explain the invention more clearly, the following terms are defined and explained below:
(1) word length: i.e. the length of the word. A Chinese word is made up of one or more Chinese characters, and the length of a word is equal to the number of characters contained in the word. The word with the word length of 1 is called a word, the word with the word length of 2 is called a word of two, the word with the word length of 3 is called a word of three, and so on.
(2) Multi-word words: the Chinese character with the word length of 3 or more is called a multi-character word with certain meaning, such as 'middle spirit' and 'positive energy', wherein the former is a four-character word, and the latter is a three-character word.
(3) Dictionary: a list of words comprising a group of words, wherein the words may be single-word words (i.e. word length of 1), two-word words (i.e. word length of 2) or multi-word words.
(4) The new term: given a dictionary of terms, terms that do not appear in the dictionary are referred to as new terms, as well as unreported terms.
(5) Text word sequence and sentence word sequence: the word segmentation is performed on a text to form a word sequence, which is called a text word sequence, or simply a word sequence. When the text is a sentence, we also sometimes refer to the sequence of words of the sentence. In the clear context, we also refer to the sequence of words of the sentence simply as a sequence of words.
(6) CLAICTS system: a free, open source word segmentation system. The system takes a text as input and outputs a word segmentation sequence of the text. The downloading website of the CLAICTS system is as follows: http: // ictlas. nlpir. org. After word segmentation, each word segmentation is marked with part of speech, wherein a represents an adjective, b represents a distinct word, c represents a conjunctive word, d represents an adverb, h represents a prefix word, j represents an acronym, k represents a suffix word, m represents a numerator, n represents a noun, p represents a preposition word, q represents a quantifier, r represents a pronoun, u represents a helper word, z represents a status word, and the like.
(7) Text: whether it is a short sentence or a long article, we refer to the text as a whole. For simplicity, in the specification, all the web pages mentioned in the specification refer to plain texts obtained by removing HTML tags, CSS codes (i.e., Cascading Style Sheets, full english names Style Sheets), DIV codes (i.e., positioning technology in Cascading Style Sheets, full english names Division), JS codes (i.e., full english names javascript), and the like, without any special explanation.
(8) Character string splicing: given any two character strings V
1And V
2,V
1And V
2Is a new character string obtained by seamlessly connecting the characters together and is marked as
For example, V
1Am, V
2In the case of the analysis,
(9) intersection, union, cardinality of sets: given two sets S1And S2Their union is denoted as Interset (S)1,S2) Is caused by those appearing at S1Is also present at S2Is a collection of elements (a). For example, S1(notebook, computer }, S)2Set (S) when { notebook, computer, 4G }1,S2) = notebook, computer }. Now appear in S1Or at S2The set of elements in (1) is S1And S2Union of (1), denoted Union (S)1,S2). For example, S1(notebook, computer }, S)2(4G) { notebook, pc, Union (S) }1,S2) { notebook, computer, 4G }. The number of elements in the set is the cardinality of the set, S1Is noted as | S1L, e.g. S1(ii) { notebook, computer }, | S1|=2。
The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.
The invention provides a new term recognition method, which takes a text set RCorpus as input (called an input text library) and adopts the following three modules to work:
a module A: and segmenting each text in the input text library RCorpus to form a text word sequence.
And a module B: and carrying out new term recognition on each text word sequence in the segmented text library TCorpus.
And a module C: the identified new term is validated.
A module A: and segmenting each text in the input text library RCorpus to form a text word sequence.
An open source ICTCCLAS system is adopted to perform word segmentation on each input text D in RCorpus, and the word segmentation result is T' ═ W1/pos1W2/pos2… Wi/posi… Wn/posnWherein each WiIs a Chinese word, Chinese character, punctuation mark, Arabic numeral, English word or letter, posiIs its corresponding part of speech.
In word segmentation, the tagging of parts of speech has been common in the computer world. Common parts of speech include n (name), v (verb), a (adjective), d (adverb), p (preposition), and the like. For example, the word segmentation result of the sentence "headquarters organization cadres learn central spirits" is: "headquarters/n organization/n trunks/n learning/v center/n spirit/n".
To illustrate the distinction, each text in RCorpus is segmented, and the resulting text is denoted as TCorpus.
And a module B: and carrying out new term recognition on each text word sequence in the segmented text library TCorpus.
The specific implementation of module B is as follows:
suppose that the current text to be recognized is Di,TiFor its title, SijIs DiThe current j-th sentence to be identified. To SijThe following steps are performed to form candidate new term results (also called new term results to be verified), which are stored in the set tmp _ result:
step B1: tmp _ result is set to null.
tmp _ result is used for storing the result of the identified new term and transmitting the result to the verification module C for verification. Thus, the new term result in tmp _ result is also referred to as a candidate new term result (also referred to as a new term result to be verified).
Step B2: will SijThe continuous longest word in the list with part of speech marked as a, b, j, n, m and q forms a candidate new term, which is marked as NewTerm.
Described in step B2 "Continuous longest "means at SijThe two ends of middle NewTerm have no words with the parts of speech a, b, j and n.
Step B3: if at S
ijThe part of speech of the word W immediately following NewTerm is k (i.e., W may be a suffix of NewTerm), then the setting is set
Step B4: if at S
ijThe part of speech of the word W in front of NewTerm is h (i.e., W may be the prefix of NewTerm), then the setting is set
Step B5: will (NewTerm, T)i,Sij) Put into tmp _ result.
And a module C: the identified new term is validated.
The module mainly works by adopting a multi-source verification method and a special verification method to verify terms in tmp _ result generated by the module B, and the verified new terms are put into the set result. The method of module C is as follows:
step C1: result is set to null.
Step C2: for each of the pairs tmp _ result (NewTerm, T)i,Sij) The loop is made with the following steps C3, C4, and C5.
Step C3: if present in tmp _ result (NewTerm, T)i′,Sij') and T)iAnd Ti' different (i.e., NewTerm appears in two different texts in TCorpus), then NewTerm is placed in result; otherwise, step C4 is performed.
As described above in step C3, although NewTerm is entitled TiSentence SijIs identified as a candidate new term, but NewTerm is not necessarily a correct new term. However, under the title TiStatement S of `ijIf also identified as a new term in' then the likelihood that NewTerm is the correct new term is greatly increased. (of course, if there is a third different text to assist in verifying NewTerm, NewTerm is more likely to be the correct new term, but this reduces the bandwidth of new term recognition, experiments have shown that verification of a third different text is not required. )
Step C4, if a seed Term exists in the seed dictionary, so that the weighted similarity wsim (NewTerm, Term) > α of NewTerm and Term (α belongs to [0, 1] as a threshold), then put NewTerm in result, otherwise, execute step C5.
To give a calculation of the weighted similarity wsim (NewTerm, Term) of the two terms, we first give a calculation method of the function 2 gram. For a non-empty Chinese character string set ═ C1C2…Ci-1Ci…CK-1CKIn which C isiFor Chinese characters, numbers and English letters, we introduce a Chinese character string with head and tail marks1C2…Ci-1Ci…CK-1CKAnd $ 3. 2gram (set) is a set of two consecutive characters from left to right in set, i.e., 2gram (set) { $ C1,C1C2,…,Ck-1CK,CK$}。
It should be noted that the importance of each element in the 2gram (Sent) is the same: ci-1CiWhen it is a word in Chinese, Ci-1CiThe effect is greater at 2gram (Sent). To reflect the importance of each element in the 2gram (Sent), the present invention targets Interset (S) as defined above1,S2) Improved by introducing a new base called the weighted intersection base, Winterset (S)1,S2). The calculation method is as follows: for given two sets S1And S2:
(1)WInterset(S1,S2)=0;
(2) For Interset (S)1,S2) Each element e, if e is a word in Chinese, Winterset (S)1,S2)=WInterset(S1,S2) +1.2 (i.e., Winterset (S)1,S2) Add 1.2 instead of 1); otherwise if e is not a word in Chinese, thenWInterset(S1,S2)=WInterset(S1,S2) +1 (i.e. Winterset (S)1,S2) Accumulation 1);
the wsim (NewTerm, Term) is calculated as follows:
(1) if NewTerm has the same prefix and suffix as Term, wsim (NewTerm, Term) is 1;
(2) if NewTerm does not have the same prefix and suffix as Term,
for the first case of the above formula, we give an example for explanation. Let NewTerm be "the eighteenth fourth meeting in total", and Term be "the lothane meeting in total". At this time wsim ("eighteenth conference in total, fourth conference in total, or" conference in total ") is 1 because Term and NewTerm have a common prefix and suffix.
Step C5: using NewTerm at SijIs verified. The specific method comprises the following steps: when NewTerm is at SijThe part of speech of the preceding participle is one of c, d, p, r, u, z, and NewTerm is at SijWhen the part of speech of the following participles is one of c, d, p, r, u and z, NewTerm is a correct new term and is added into result; otherwise it is discarded, i.e. not added to result.
Step C6: and outputting result as a final result.
Effect of the experiment
When the similarity wsim (New Term, Term) of new terms and seed terms in a dictionary is calculated, repeated experiments show that the result is the best when α is 0.6, and the recognition precision of the new terms of multiple words is 93.8%.