CN106815187B - New term recognition method - Google Patents

New term recognition method Download PDF

Info

Publication number
CN106815187B
CN106815187B CN201510845390.7A CN201510845390A CN106815187B CN 106815187 B CN106815187 B CN 106815187B CN 201510845390 A CN201510845390 A CN 201510845390A CN 106815187 B CN106815187 B CN 106815187B
Authority
CN
China
Prior art keywords
newterm
word
result
term
new term
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510845390.7A
Other languages
Chinese (zh)
Other versions
CN106815187A (en
Inventor
符建辉
王卫明
曹阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongke Guoli Zhenjiang Intelligent Technology Co ltd
Original Assignee
Zhongke Guoli Zhenjiang Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongke Guoli Zhenjiang Intelligent Technology Co ltd filed Critical Zhongke Guoli Zhenjiang Intelligent Technology Co ltd
Priority to CN201510845390.7A priority Critical patent/CN106815187B/en
Publication of CN106815187A publication Critical patent/CN106815187A/en
Application granted granted Critical
Publication of CN106815187B publication Critical patent/CN106815187B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Abstract

The invention relates to a new term recognition system and method, the system includes dividing word to each document in input text library RCorpus, forming text word sequence module A; performing a new term recognition module B on each document word sequence in the segmented text library TCorpus; a verification module C for the identified new terms; the method comprises the following steps: the first step is as follows: the text word sequence module A carries out word segmentation on each text in an input text library RCorpus to form a text word sequence; the second step is that: the new term recognition module B carries out new term recognition on each text word sequence in the segmented text library TCorpus; the third step: the verification module C verifies the identified new term; the invention provides a new term identification method and system with high precision and high recall rate. The recognition accuracy of the new term is 93.8%.

Description

New term recognition method
Technical Field
The invention relates to the field of Chinese natural language processing and automatic identification of new Chinese words, in particular to an automatic identification method of new terms.
Background
Along with the rapid development of the internet, various new terms come out endlessly, and the natural language processing application, the automatic application software (such as a word segmentation system) and the dictionary collecting and writing work light bring great difficulty.
Research into the identification of new terms has been underway for many years. There are three types of existing methods. A first statistical-based approach. For example, Kenneth Ward Church and Beatrice Daille et al use mutual information (Mutualinformation) to extract fixed combinations and collocations of words, consider that frequently co-occurring adjacent character combinations are generally terms, and then use the mutual information to determine the co-occurrence degree of phrases. As another example, Ted Dunning and Jonathan d. Statistical methods also include conditional random field methods, hidden markov methods, maximum entropy methods, and the like. The second is a method based on linguistic features and lexical patterns. For example, Liu Lei, Wang Shi and Tian Guo just adopt multiple features, and combine lexical and syntactic patterns to obtain new professional terms. Thirdly, the integrated application of the first two methods overcomes the respective disadvantages.
However, the above method has the following two problems through detailed experimental analysis.
Problem 1: the new term identifies the accuracy problem. With a purely statistical approach, although more new terms can be identified, a large number of errors are usually introduced; that is, a Chinese string that is not a new term is mistakenly considered to be a new term. For example, in the case of the phrase "headquarter organizing trunk learning central spirit", when a statistical method is employed, it is easy to erroneously recognize "headquarter organizing trunk", "trunk learning", and the like as new terms, but essentially none of them is. On the other hand, the recognition accuracy of the new term limit is high, and the recognition scope is limited. This is one of the key problems that the present invention needs to solve.
Problem 2: new term identification breadth problem. Because the combination of words is many, the automatic recognition of new terms easily misses new meaningful terms. Therefore, how to improve the recognition breadth is an important issue. This is also one of the key problems that the present invention needs to solve.
Disclosure of Invention
The technical problems to be solved by the invention are as follows: the new term identification precision problem and the identification breadth problem.
In response to problem 1, the present invention introduces a seed term dictionary technique that not only utilizes a seed dictionary for the recognition of new terms, but also uses it to validate newly obtained new terms.
Aiming at the problem 2, the invention introduces a multi-source iterative new term identification technology. Firstly, a multi-source analysis method is adopted, comparison verification is carried out according to a plurality of texts, and the recognition precision of new terms is improved; and simultaneously, adding the obtained new terms into the seed term dictionary, and continuously recycling the new terms so as to obtain more new terms.
In order to achieve the purpose, the invention provides the following technical scheme:
a new term recognition method is characterized in that: the method comprises the following steps:
the first step is as follows: the text word sequence module A carries out word segmentation on each text in an input text library RCorpus to form a text word sequence;
an open source ICTCCLAS system is adopted to perform word segmentation on each input text D in RCorpus, and the word segmentation result is T' ═ W1/pos1W2/pos2… Wi/posi… Wn/posnWherein each WiIs a Chinese word, Chinese character, punctuation mark, Arabic numeral, English word or letter, posiIs its corresponding part of speech;
for representing differences, each text in RCorpus is subjected to word segmentation, and the generated text is marked as TCorpus;
the second step is that: the new term recognition module B carries out new term recognition on each text word sequence in the segmented text library TCorpus;
the current text to be recognized is Di,TiFor its title, SijIs DiThe current j-th statement to be identified; to SijProcessing the following steps is carried out, and a candidate new term result is formed and stored in the set tmp _ result:
step B1: set tmp _ result to null;
tmp _ result is used for storing the result of the identified new term and transmitting the result to the verification module C for verification. Therefore, the new term result in tmp _ result is also called candidate new term result, also called new term result to be verified;
step B2: will SijForming a candidate new term by the continuous longest word with the part of speech marked as a, b, j, n, m and q, and marking as NewTerm; the term "continuous longest" means at SijThe two ends of the middle NewTerm have no words with the parts of speech a, b, j and n;
step B3: if at SijThe part of speech of the word W immediately following NewTerm is k, i.e., W may beSuffix of NewTerm, then set
Figure GDA0002314208120000021
Step B4: if at SijThe part of speech of the word W in front of NewTerm is h, i.e. W may be the prefix of NewTerm, then the setting is made
Figure GDA0002314208120000022
Step B5: will (NewTerm, T)i,Sij) Put into tmp _ result;
the third step: the verification module C verifies the identified new term;
the verification module C mainly works by adopting a multi-source verification method and a special verification method to verify a new term in tmp _ result generated by the new term identification module B, and the verified new term is put into the set result; the method of verifying module C is as follows:
step C1: setting result to null;
step C2: for each of the pairs tmp _ result (NewTerm, T)i,Sij) Circularly performing the following steps C3, C4 and C5;
step C3: if present in tmp _ result (NewTerm, T)i′,Sij') and T)iAnd Ti"different," i.e., NewTerm appears in two different texts in TCorpus, "NewTerm is placed in result; otherwise, step C4 is executed:
as described above in step C3, although NewTerm is entitled TiSentence SijIs identified as a candidate new term, but NewTerm is not necessarily a correct new term; however, under the title TiStatement S of `ijIf new terms are also identified in' then the likelihood that NewTerm is the correct new term is greatly increased;
step C4, if a seed Term exists in the seed dictionary, so that the weighted similarity wsim (NewTerm, Term) of NewTerm and Term is more than α, wherein α belongs to [0, 1] as a threshold value, putting the NewTerm into result, otherwise, executing step C5;
to give a calculation of the weighted similarity wsim (NewTerm, Term) of the two terms, we first give a calculation method of the function 2 gram; for a non-empty Chinese character string set ═ C1C2…Ci-1Ci…CK-1CKIn which C isiFor Chinese characters, numbers and English letters, we introduce a Chinese character string with head and tail marks1C2…Ci-1Ci…CK-1CK$ 3; 2gram (set) is a set of two consecutive characters from left to right in set, i.e., 2gram (set) { $ C1,C1C2,…,Ck-1CK,CK$};
It should be noted that the importance of each element in the 2gram (Sent) is not the same: ci-1CiWhen it is a word in Chinese, Ci-1CiThe effect is greater at 2gram (Sent); to reflect the importance of each element in 2gram (Sent), for Interset (S)1,S2) Improved by introducing a new base called the weighted intersection base, Winterset (S)1,S2) (ii) a The calculation method is as follows: for given two sets S1And S2
(1)WInterset(S1,S2)=0;
(2) For Interset (S)1,S2) Each element e, if e is a word in Chinese, Winterset (S)1,S2)=WInterset(S1,S2) +1.2, i.e. Winterset (S)1,S2) 1.2 instead of 1; otherwise WInterset (S)1,S2)=WInterset(S1,S2) +1, i.e. Winterset (S)1,S2) Accumulating for 1;
the wsim (NewTerm, Term) is calculated as follows:
(1) if NewTerm has the same prefix and suffix as Term, wsim (NewTerm, Term) is 1;
(2) if NewTerm does not have the same prefix and suffix as Term,
Figure GDA0002314208120000031
Figure GDA0002314208120000032
wherein, the intersection, union and base number of the sets are as follows: given two sets S1And S2Their union is denoted as Interset (S)1,S2);
Step C5: using NewTerm at SijVerifying the context of the user; the specific method comprises the following steps: when NewTerm is at SijThe part of speech of the preceding participle is one of c, d, p, r, u, z, and NewTerm is at SijWhen the part of speech of the following participles is one of c, d, p, r, u and z, NewTerm is a correct new term and is added into result; otherwise, abandoning, namely not adding into result;
step C6: and outputting result as a final result.
Has the advantages that: the invention provides a new term identification method and system with high precision and high recall rate. In tests with up to 2GB page corpora, various industries and professional areas are covered in addition to the news area. The recognition accuracy of the new term is 93.8%. Therefore, the invention obtains better recognition performance, achieves the aim of practical application, and lays a solid foundation for a large number of applications such as dictionary compiling, word segmentation application, text classification, public opinion analysis, advertisement analysis and the like.
Drawings
FIG. 1 is a workflow diagram of a new term identification method of the present invention;
FIG. 2 is a schematic diagram of the method of operation of module B;
fig. 3 is a schematic diagram of the operation method of module C.
Detailed Description
In order to be able to explain the invention more clearly, the following terms are defined and explained below:
(1) word length: i.e. the length of the word. A Chinese word is made up of one or more Chinese characters, and the length of a word is equal to the number of characters contained in the word. The word with the word length of 1 is called a word, the word with the word length of 2 is called a word of two, the word with the word length of 3 is called a word of three, and so on.
(2) Multi-word words: the Chinese character with the word length of 3 or more is called a multi-character word with certain meaning, such as 'middle spirit' and 'positive energy', wherein the former is a four-character word, and the latter is a three-character word.
(3) Dictionary: a list of words comprising a group of words, wherein the words may be single-word words (i.e. word length of 1), two-word words (i.e. word length of 2) or multi-word words.
(4) The new term: given a dictionary of terms, terms that do not appear in the dictionary are referred to as new terms, as well as unreported terms.
(5) Text word sequence and sentence word sequence: the word segmentation is performed on a text to form a word sequence, which is called a text word sequence, or simply a word sequence. When the text is a sentence, we also sometimes refer to the sequence of words of the sentence. In the clear context, we also refer to the sequence of words of the sentence simply as a sequence of words.
(6) CLAICTS system: a free, open source word segmentation system. The system takes a text as input and outputs a word segmentation sequence of the text. The downloading website of the CLAICTS system is as follows: http: // ictlas. nlpir. org. After word segmentation, each word segmentation is marked with part of speech, wherein a represents an adjective, b represents a distinct word, c represents a conjunctive word, d represents an adverb, h represents a prefix word, j represents an acronym, k represents a suffix word, m represents a numerator, n represents a noun, p represents a preposition word, q represents a quantifier, r represents a pronoun, u represents a helper word, z represents a status word, and the like.
(7) Text: whether it is a short sentence or a long article, we refer to the text as a whole. For simplicity, in the specification, all the web pages mentioned in the specification refer to plain texts obtained by removing HTML tags, CSS codes (i.e., Cascading Style Sheets, full english names Style Sheets), DIV codes (i.e., positioning technology in Cascading Style Sheets, full english names Division), JS codes (i.e., full english names javascript), and the like, without any special explanation.
(8) Character string splicing: given any two character strings V1And V2,V1And V2Is a new character string obtained by seamlessly connecting the characters together and is marked as
Figure GDA0002314208120000051
For example, V1Am, V2In the case of the analysis,
Figure GDA0002314208120000052
(9) intersection, union, cardinality of sets: given two sets S1And S2Their union is denoted as Interset (S)1,S2) Is caused by those appearing at S1Is also present at S2Is a collection of elements (a). For example, S1(notebook, computer }, S)2Set (S) when { notebook, computer, 4G }1,S2) = notebook, computer }. Now appear in S1Or at S2The set of elements in (1) is S1And S2Union of (1), denoted Union (S)1,S2). For example, S1(notebook, computer }, S)2(4G) { notebook, pc, Union (S) }1,S2) { notebook, computer, 4G }. The number of elements in the set is the cardinality of the set, S1Is noted as | S1L, e.g. S1(ii) { notebook, computer }, | S1|=2。
The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.
The invention provides a new term recognition method, which takes a text set RCorpus as input (called an input text library) and adopts the following three modules to work:
a module A: and segmenting each text in the input text library RCorpus to form a text word sequence.
And a module B: and carrying out new term recognition on each text word sequence in the segmented text library TCorpus.
And a module C: the identified new term is validated.
A module A: and segmenting each text in the input text library RCorpus to form a text word sequence.
An open source ICTCCLAS system is adopted to perform word segmentation on each input text D in RCorpus, and the word segmentation result is T' ═ W1/pos1W2/pos2… Wi/posi… Wn/posnWherein each WiIs a Chinese word, Chinese character, punctuation mark, Arabic numeral, English word or letter, posiIs its corresponding part of speech.
In word segmentation, the tagging of parts of speech has been common in the computer world. Common parts of speech include n (name), v (verb), a (adjective), d (adverb), p (preposition), and the like. For example, the word segmentation result of the sentence "headquarters organization cadres learn central spirits" is: "headquarters/n organization/n trunks/n learning/v center/n spirit/n".
To illustrate the distinction, each text in RCorpus is segmented, and the resulting text is denoted as TCorpus.
And a module B: and carrying out new term recognition on each text word sequence in the segmented text library TCorpus.
The specific implementation of module B is as follows:
suppose that the current text to be recognized is Di,TiFor its title, SijIs DiThe current j-th sentence to be identified. To SijThe following steps are performed to form candidate new term results (also called new term results to be verified), which are stored in the set tmp _ result:
step B1: tmp _ result is set to null.
tmp _ result is used for storing the result of the identified new term and transmitting the result to the verification module C for verification. Thus, the new term result in tmp _ result is also referred to as a candidate new term result (also referred to as a new term result to be verified).
Step B2: will SijThe continuous longest word in the list with part of speech marked as a, b, j, n, m and q forms a candidate new term, which is marked as NewTerm.
Described in step B2 "Continuous longest "means at SijThe two ends of middle NewTerm have no words with the parts of speech a, b, j and n.
Step B3: if at SijThe part of speech of the word W immediately following NewTerm is k (i.e., W may be a suffix of NewTerm), then the setting is set
Figure GDA0002314208120000061
Step B4: if at SijThe part of speech of the word W in front of NewTerm is h (i.e., W may be the prefix of NewTerm), then the setting is set
Figure GDA0002314208120000062
Step B5: will (NewTerm, T)i,Sij) Put into tmp _ result.
And a module C: the identified new term is validated.
The module mainly works by adopting a multi-source verification method and a special verification method to verify terms in tmp _ result generated by the module B, and the verified new terms are put into the set result. The method of module C is as follows:
step C1: result is set to null.
Step C2: for each of the pairs tmp _ result (NewTerm, T)i,Sij) The loop is made with the following steps C3, C4, and C5.
Step C3: if present in tmp _ result (NewTerm, T)i′,Sij') and T)iAnd Ti' different (i.e., NewTerm appears in two different texts in TCorpus), then NewTerm is placed in result; otherwise, step C4 is performed.
As described above in step C3, although NewTerm is entitled TiSentence SijIs identified as a candidate new term, but NewTerm is not necessarily a correct new term. However, under the title TiStatement S of `ijIf also identified as a new term in' then the likelihood that NewTerm is the correct new term is greatly increased. (of course, if there is a third different text to assist in verifying NewTerm, NewTerm is more likely to be the correct new term, but this reduces the bandwidth of new term recognition, experiments have shown that verification of a third different text is not required. )
Step C4, if a seed Term exists in the seed dictionary, so that the weighted similarity wsim (NewTerm, Term) > α of NewTerm and Term (α belongs to [0, 1] as a threshold), then put NewTerm in result, otherwise, execute step C5.
To give a calculation of the weighted similarity wsim (NewTerm, Term) of the two terms, we first give a calculation method of the function 2 gram. For a non-empty Chinese character string set ═ C1C2…Ci-1Ci…CK-1CKIn which C isiFor Chinese characters, numbers and English letters, we introduce a Chinese character string with head and tail marks1C2…Ci-1Ci…CK-1CKAnd $ 3. 2gram (set) is a set of two consecutive characters from left to right in set, i.e., 2gram (set) { $ C1,C1C2,…,Ck-1CK,CK$}。
It should be noted that the importance of each element in the 2gram (Sent) is the same: ci-1CiWhen it is a word in Chinese, Ci-1CiThe effect is greater at 2gram (Sent). To reflect the importance of each element in the 2gram (Sent), the present invention targets Interset (S) as defined above1,S2) Improved by introducing a new base called the weighted intersection base, Winterset (S)1,S2). The calculation method is as follows: for given two sets S1And S2
(1)WInterset(S1,S2)=0;
(2) For Interset (S)1,S2) Each element e, if e is a word in Chinese, Winterset (S)1,S2)=WInterset(S1,S2) +1.2 (i.e., Winterset (S)1,S2) Add 1.2 instead of 1); otherwise if e is not a word in Chinese, thenWInterset(S1,S2)=WInterset(S1,S2) +1 (i.e. Winterset (S)1,S2) Accumulation 1);
the wsim (NewTerm, Term) is calculated as follows:
(1) if NewTerm has the same prefix and suffix as Term, wsim (NewTerm, Term) is 1;
(2) if NewTerm does not have the same prefix and suffix as Term,
Figure GDA0002314208120000071
Figure GDA0002314208120000072
for the first case of the above formula, we give an example for explanation. Let NewTerm be "the eighteenth fourth meeting in total", and Term be "the lothane meeting in total". At this time wsim ("eighteenth conference in total, fourth conference in total, or" conference in total ") is 1 because Term and NewTerm have a common prefix and suffix.
Step C5: using NewTerm at SijIs verified. The specific method comprises the following steps: when NewTerm is at SijThe part of speech of the preceding participle is one of c, d, p, r, u, z, and NewTerm is at SijWhen the part of speech of the following participles is one of c, d, p, r, u and z, NewTerm is a correct new term and is added into result; otherwise it is discarded, i.e. not added to result.
Step C6: and outputting result as a final result.
Effect of the experiment
When the similarity wsim (New Term, Term) of new terms and seed terms in a dictionary is calculated, repeated experiments show that the result is the best when α is 0.6, and the recognition precision of the new terms of multiple words is 93.8%.

Claims (1)

1. A new term recognition method is characterized in that: the method comprises the following steps:
the first step is as follows: the text word sequence module A carries out word segmentation on each text in an input text library RCorpus to form a text word sequence;
an open source ICTCCLAS system is adopted to perform word segmentation on each input text D in RCorpus, and the word segmentation result is T' ═ W1/pos1W2/pos2… Wi/posi… Wn/posnWherein each WiIs a Chinese word, Chinese character, punctuation mark, Arabic numeral, English word or letter, posiIs its corresponding part of speech;
for representing differences, each text in RCorpus is subjected to word segmentation, and the generated text is marked as TCorpus;
the second step is that: the new term recognition module B carries out new term recognition on each text word sequence in the segmented text library TCorpus;
the current text to be recognized is Di,TiFor its title, SijIs DiThe current j-th statement to be identified; to SijProcessing the following steps is carried out, and a candidate new term result is formed and stored in the set tmp _ result:
step B1: set tmp _ result to null;
tmp _ result is used for storing the identified new term result and transmitting the new term result to the verification module C for verification; therefore, the new term result in tmp _ result is also called candidate new term result, also called new term result to be verified;
step B2: will SijForming a candidate new term by the continuous longest word with the part of speech marked as a, b, j, n, m and q, and marking as NewTerm; the term "continuous longest" means at SijThe two ends of the middle NewTerm have no words with the parts of speech a, b, j and n;
step B3: if at SijThe part of speech of the word W immediately following NewTerm is k, i.e. W may be a suffix of NewTerm, then the setting is made
Figure FDA0002314208110000011
Step B4: if at SijThe part of speech of the word W in front of NewTerm is h, i.e. W may be the prefix of NewTerm, then the setting is made
Figure FDA0002314208110000012
Step B5: will (NewTerm, T)i,Sij) Put into tmp _ result;
the third step: the verification module C verifies the identified new term;
the verification module C mainly works by adopting a multi-source verification method and a special verification method to verify a new term in tmp _ result generated by the new term identification module B, and the verified new term is put into the set result; the method of verifying module C is as follows:
step C1: setting result to null;
step C2: for each of the pairs tmp _ result (NewTerm, T)i,Sij) Circularly performing the following steps C3, C4 and C5;
step C3: if present in tmp _ result (NewTerm, T)i′,Sij') and T)iAnd Ti"different," i.e., NewTerm appears in two different texts in TCorpus, "NewTerm is placed in result; otherwise, step C4 is executed:
as described above in step C3, although NewTerm is entitled TiSentence SijIs identified as a candidate new term, but NewTerm is not necessarily a correct new term; however, under the title TiStatement S of `ijIf new terms are also identified in' then the likelihood that NewTerm is the correct new term is greatly increased;
step C4, if a seed Term exists in the seed dictionary, so that the weighted similarity wsim (NewTerm, Term) of NewTerm and Term is more than α, wherein α belongs to [0, 1] as a threshold value, putting the NewTerm into result, otherwise, executing step C5;
to give a calculation of the weighted similarity wsim (NewTerm, Term) of the two terms, we first give a calculation method of the function 2 gram; for a non-empty Chinese character string set ═ C1C2…Ci-1Ci…CK-1CKIn which C isiFor Chinese characters, numbers and English letters, we introduce a Chinese character string with head and tail marks1C2…Ci-1Ci…CK-1CK$ 3; 2gram (set) is a set of two consecutive characters from left to right in set, i.e., 2gram (set) { $ C1,C1C2,…,Ck-1CK,CK$};
It should be noted that the importance of each element in the 2gram (Sent) is not the same: ci-1CiWhen it is a word in Chinese, Ci-1CiThe effect is greater at 2gram (Sent); to reflect the importance of each element in 2gram (Sent), for Interset (S)1,S2) Improved by introducing a new base called the weighted intersection base, Winterset (S)1,S2) Wherein; the calculation method is as follows: for given two sets S1And S2:(1)WInterset(S1,S2)=0;
(2) For Interset (S)1,S2) Each element e, if e is a word in Chinese, Winterset (S)1,S2)=WInterset(S1,S2) +1.2, i.e. Winterset (S)1,S2) 1.2 instead of 1; otherwise WInterset (S)1,S2)=WInterset(S1,S2) +1, i.e. Winterset (S)1,S2) Accumulating for 1;
the wsim (NewTerm, Term) is calculated as follows:
(1) if NewTerm has the same prefix and suffix as Term, wsim (NewTerm, Term) is 1;
(2) if NewTerm and Term do not have the same prefix sumThe ornament is composed of a main body and a back body,
Figure FDA0002314208110000021
Figure FDA0002314208110000022
wherein, the intersection, union and base number of the sets are as follows: given two sets S1And S2Their union is denoted as Interset (S)1,S2);
Step C5: using NewTerm at SijVerifying the context of the user; the specific method comprises the following steps: when NewTerm is at SijThe part of speech of the preceding participle is one of c, d, p, r, u, z, and NewTerm is at SijWhen the part of speech of the following participles is one of c, d, p, r, u and z, NewTerm is a correct new term and is added into result; otherwise, abandoning, namely not adding into result;
step C6: outputting result as the final result;
wherein a represents an adjective, b represents a distinct word, c represents a conjunctive word, d represents an adverb, h represents a prefix word, j represents an acronym, k represents a suffix word, m represents a numerator, n represents a noun, p represents a preposition word, q represents a quantifier, r represents a pronoun, u represents a preposition word, and z represents a status word.
CN201510845390.7A 2015-11-27 2015-11-27 New term recognition method Active CN106815187B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510845390.7A CN106815187B (en) 2015-11-27 2015-11-27 New term recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510845390.7A CN106815187B (en) 2015-11-27 2015-11-27 New term recognition method

Publications (2)

Publication Number Publication Date
CN106815187A CN106815187A (en) 2017-06-09
CN106815187B true CN106815187B (en) 2020-04-14

Family

ID=59101945

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510845390.7A Active CN106815187B (en) 2015-11-27 2015-11-27 New term recognition method

Country Status (1)

Country Link
CN (1) CN106815187B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112084979B (en) * 2020-09-14 2023-07-11 武汉轻工大学 Food ingredient identification method, device, equipment and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102693244B (en) * 2011-03-23 2015-04-01 日电(中国)有限公司 Method and device for identifying information in non-structured text
CN102955771A (en) * 2011-08-18 2013-03-06 华东师范大学 Technology and system for automatically recognizing Chinese new words in single-word-string mode and affix mode
CN102708147B (en) * 2012-03-26 2015-02-18 北京新发智信科技有限责任公司 Recognition method for new words of scientific and technical terminology

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
LONG TERM HUMAN ACTIVITY RECOGNITION WITH AUTOMATIC ORIENTATION ESTIMATION;Blanca Florentino-Liano 等;《2012 IEEE INTERNATIOANAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL RROCESSING》;20120926;第1-6页 *
基于种子扩充的专业术语识别方法研究;王卫民;《计算机应用研究》;20121130;第29卷(第11期);第4105-4107页 *
术语定义抽取、聚类与术语识别研究;张榕;《中国优秀博硕士学位论文全文数据库(博士)》;20061115(第11期);第F084-20页 *

Also Published As

Publication number Publication date
CN106815187A (en) 2017-06-09

Similar Documents

Publication Publication Date Title
US8660834B2 (en) User input classification
Chen et al. Feature embedding for dependency parsing
Jabbar et al. An improved Urdu stemming algorithm for text mining based on multi-step hybrid approach
Bebah et al. Hybrid approaches for automatic vowelization of Arabic texts
Matteson et al. Rich character-level information for Korean morphological analysis and part-of-speech tagging
Shaalan et al. A hybrid approach for building Arabic diacritizer
Banerjee et al. Bengali named entity recognition using margin infused relaxed algorithm
Ekbal et al. Named entity recognition and transliteration in Bengali
Scholivet et al. Identification of ambiguous multiword expressions using sequence models and lexical resources
Nehar et al. Rational kernels for Arabic root extraction and text classification
López et al. Experiments on sentence boundary detection in user-generated web content
Al-Jefri et al. Context-sensitive Arabic spell checker using context words and n-gram language models
Patra et al. Part of speech (pos) tagger for kokborok
Aziz et al. Urdu spell checker: A scarce resource language
CN106815187B (en) New term recognition method
Saito et al. Multi-language named-entity recognition system based on HMM
Shamsfard et al. STeP-1: standard text preparation for Persian language
Tongtep et al. Multi-stage automatic NE and pos annotation using pattern-based and statistical-based techniques for thai corpus construction
Zhang et al. A unified framework for grammar error correction
Azmi et al. Light diacritic restoration to disambiguate homographs in modern Arabic texts
Tien et al. Vietnamese Spelling Error Detection and Correction Using BERT and N-gram Language Model
Rajendran et al. Text processing for developing unrestricted Tamil text to speech synthesis system
Mirzanezhad et al. Using morphological analyzer to statistical POS Tagging on Persian Text
Tummalapalli et al. Syllables for sentence classification in morphologically rich languages
Sharma Assigning the correct word class to Punjabi unknown words using CRF

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 212009 Zhenjiang high tech Industrial Development Zone, Jiangsu, No. 668, No. twelve road.

Applicant after: Zhongke national power (Zhenjiang) Intelligent Technology Co., Ltd.

Address before: 212009 18 building, North Tower, Twin Tower Rd 468, twelve road 468, Ding Mo Jing, Jiangsu.

Applicant before: Knowology Intelligent Technology Co., Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant
CB03 Change of inventor or designer information

Inventor after: Fu Jianhui

Inventor after: Wang Weimin

Inventor after: Cao Yang

Inventor before: Fu Jianhui

Inventor before: Wang Weiming

Inventor before: Cao Yang

CB03 Change of inventor or designer information