CN103049501A - Chinese domain term recognition method based on mutual information and conditional random field model - Google Patents

Chinese domain term recognition method based on mutual information and conditional random field model Download PDF

Info

Publication number
CN103049501A
CN103049501A CN2012105287348A CN201210528734A CN103049501A CN 103049501 A CN103049501 A CN 103049501A CN 2012105287348 A CN2012105287348 A CN 2012105287348A CN 201210528734 A CN201210528734 A CN 201210528734A CN 103049501 A CN103049501 A CN 103049501A
Authority
CN
China
Prior art keywords
word
string
word string
evaluation function
random field
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012105287348A
Other languages
Chinese (zh)
Other versions
CN103049501B (en
Inventor
彭琳
刘宗田
杨林楠
张立敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHANGHAI UNIVERSITY
Original Assignee
SHANGHAI UNIVERSITY
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHANGHAI UNIVERSITY filed Critical SHANGHAI UNIVERSITY
Priority to CN201210528734.8A priority Critical patent/CN103049501B/en
Publication of CN103049501A publication Critical patent/CN103049501A/en
Application granted granted Critical
Publication of CN103049501B publication Critical patent/CN103049501B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a Chinese domain term recognition method based on mutual information and a conditional random field model. The Chinese domain term recognition method includes the following steps: (1) gathering domain text corpus and marking all the punctuations, spaces, numbers, ASSCII (American Standard Code for Information Interchange) characters and characters except Chinese characters in the corpus; (2) setting character strings and computing the mutual information values of the character strings, (3) computing the left comentropy and the right comentropy of every character string, (4) defining character string evaluation function, setting evaluation function threshold, computing the evaluation function values of every character string, determining that every character string is a word, comparing in sequence the evaluation function value of the former character with the evaluation function value of the latter character in the character string and segmenting character meaning character strings one by one, (5) utilizing conditional random fields to train a conditional random field model and recognizing domain terms with the conditional random field model. When the Chinese domain term recognition method is used to recognize terms, the data sparsity of legitimate terms is overcome, the amount of calculation of conditional random fields is reduced, and the accuracy of the Chinese domain term recognition is improved.

Description

基于互信息和条件随机场模型的中文领域术语识别方法Chinese domain term recognition method based on mutual information and conditional random field model

技术领域 technical field

本发明涉及的是一种基于互信息和条件随机场模型的中文领域术语识别方法,属于信息技术领域。 The invention relates to a method for recognizing Chinese domain terms based on mutual information and a conditional random field model, which belongs to the field of information technology.

背景技术 Background technique

国家标准GB/T15237.1-2000《术语工作词汇》的定义,术语是特定专业领域中一般概念的词语指称,是在一个学科领域内使用、表示该学科领域内的概念或关系的词或词组。术语可以分为日常生活中使用的一般性术语和特定领域中使用的领域术语。一般性术语多是按人们的生活和工作习惯形成的,不要求它在概念的表达上严格准确,其含义往往比较模糊;领域术语是对一个专业概念的系统性、概括性的描述,不允许模棱两可,每一个专业术语表达的概念都必须准确无误,不能因使用人的不同而不同。 The definition of the national standard GB/T15237.1-2000 "Terminology Working Vocabulary" is that a term refers to a term referring to a general concept in a specific professional field, and is a word or phrase used in a subject area to express a concept or relationship in the subject area . Terminology can be divided into general terms used in daily life and domain terms used in specific fields. General terms are mostly formed according to people's living and working habits, and they are not required to be strictly accurate in the expression of concepts, and their meanings are often vague; field terms are systematic and general descriptions of a professional concept, and are not allowed Ambiguity, the concept expressed by each technical term must be accurate and cannot be different due to different users.

领域术语识别是指从特定的科学或技术领域的语料库中抽出专业领域术语。领域术语自动识别作为信息抽取的重要内容,在自然语言处理领域有着广泛的应用,对于提高领域文本索引与检索、文本挖掘、本体构建、文本分类和聚类、潜在语义分析等的处理精度有着重要的意义。现有的中文文本信息中的领域术语识别方法主要有: Domain term recognition refers to the extraction of professional domain terms from a corpus in a specific scientific or technical field. As an important content of information extraction, automatic recognition of domain terms is widely used in the field of natural language processing. It is important for improving the processing accuracy of domain text indexing and retrieval, text mining, ontology construction, text classification and clustering, and latent semantic analysis. meaning. The existing domain term recognition methods in Chinese text information mainly include:

(1)基于统计方法的中文领域术语识别方法,主要思想是利用领域术语内部各组成成分之间较高的关联程度以及术语的领域特征信息来抽取领域术语。基于统计的方法一般流程是: 首先利用统计学或信息论中的方法,建立起各种统计信息,并根据统计结果,确定比较准确的种子词;然后在此基础上不断扩展,获取最终的领域术语。词语频率、均值和方差是比较常用的统计方法,更多的学者使用假设检验的方法, 主要有T检验、卡方检验、对数似然比、点互信息等。用统计方法识别领域术语,不需要句法、语义上的信息,不局限于某一专门领域,也不依赖任何资源,通用性较强。 (1) A method for identifying Chinese domain terms based on statistical methods. The main idea is to extract domain terms by using the higher degree of correlation between the components of the domain term and the domain characteristic information of the term. The general process of the method based on statistics is as follows: First, use the methods in statistics or information theory to establish various statistical information, and determine the more accurate seed words according to the statistical results; then continue to expand on this basis to obtain the final domain terms . Word frequency, mean and variance are relatively commonly used statistical methods, and more scholars use hypothesis testing methods, mainly including T test, chi-square test, log likelihood ratio, point mutual information, etc. Using statistical methods to identify domain terms does not require syntactic and semantic information, is not limited to a specific field, does not depend on any resources, and has strong versatility.

    其中,基于统计的互信息算法应用最为广泛。例如有文章报道,其题目为“基于互信息的中文术语抽取系统”(该文作者是:张锋 许云 侯艳 樊孝忠,发表于2005年出版的《计算机应用研究》第22卷第5期第72-73,77页),该文公开了一种中文术语自动抽取系统,该系统首先基于互信息计算字串的内部结合强度,从而得到术语候选集;接着从术语候选集中去除基本词,并利用普通词语搭配前缀、后缀信息进一步过滤;最后对术语候选进行词法分析,利用术语的词性构成规则进行判别,得到最终的术语抽取结果。实验结果表明,利用互信息算法对术语抽取的准确率为72.19% ,召回率为77.98% ,F测量值为74.97%。例如有文献报道,“C值和互信息相结合的术语抽取”(作者是:梁颖红 张文静 张有承,发表于2010年出版的《计算机应用与软件》第27卷第4期第108-110页),该文公开了一种将C值和互信息相结合的术语抽取方法,该方法提出综合C-value参数在长术语抽取方面具有优势,实验结果表明,该方法对长术语抽取的准确率为75.7%,召回率为68.4%,F测量值为71.9%,高于相同语料下的其他方法。但是该算法性能直接依赖于语料库的规模和候选领域术语的词频,针对有些低频率候选术语也可能是合法术语的数据稀疏问题难以解决,所以单纯利用互信息算法对领域术语进行识别,识别的准确率、召回率以及F测量值均难以达到80%以上,很难获得理想的识别效果; Among them, the mutual information algorithm based on statistics is the most widely used. For example, there is an article report titled "Chinese Term Extraction System Based on Mutual Information" (the author of this article is: Zhang Feng, Xu Yun, Hou Yan, Fan Xiaozhong, published in "Computer Application Research", Volume 22, Issue 5, Issue 2005 72-73, p. 77), which discloses an automatic Chinese term extraction system. The system first calculates the internal combination strength of word strings based on mutual information to obtain a term candidate set; then removes basic words from the term candidate set, and Use common words with prefix and suffix information to further filter; finally, perform lexical analysis on the term candidates, use the part-of-speech composition rules of the terms to judge, and obtain the final term extraction results. The experimental results show that the accuracy rate of term extraction using mutual information algorithm is 72.19%, the recall rate is 77.98%, and the F measurement value is 74.97%. For example, there is a literature report, "Term Extraction Combining C Value and Mutual Information" (authors are: Liang Yinghong, Zhang Wenjing, Zhang Youcheng, published in "Computer Application and Software", Volume 27, Issue 4, pp. 108-110, published in 2010), This paper discloses a term extraction method that combines C-value and mutual information. This method proposes that the comprehensive C-value parameter has advantages in long term extraction. Experimental results show that the accuracy of this method for long term extraction is 75.7% %, the recall rate is 68.4%, and the F-measure value is 71.9%, which are higher than other methods under the same corpus. However, the performance of this algorithm directly depends on the size of the corpus and the word frequency of candidate domain terms. It is difficult to solve the problem of data sparseness that some low-frequency candidate terms may also be legal terms. The rate, recall rate and F measurement value are all difficult to reach more than 80%, and it is difficult to obtain the ideal recognition effect;

(2)基于机器学习的中文领域术语识别方法的主要步骤为: 采用手工或半自动方式构建训练语料, 根据某种机器学习算法对训练语料学习生成模型,然后再利用模型对测试语料进行领域术语抽取实验,以验证本算法的有效性。目前已用于中文领域术语识别的机器学习理论主要包括决策树、支持向量机、隐马尔科夫模型、最大熵模型、最大熵马尔科夫模型和条件随机场算法等。基于机器学习的术语识别方法无需专家的领域知识和语言知识, 实现可行性大, 在考虑多种术语特征的情况下可以得到较好的识别或抽取效果。 (2) The main steps of the machine learning-based Chinese domain term recognition method are: Construct the training corpus manually or semi-automatically, learn and generate a model from the training corpus according to a certain machine learning algorithm, and then use the model to extract domain terms from the test corpus Experiments are carried out to verify the effectiveness of this algorithm. The machine learning theories that have been used for term recognition in the Chinese field mainly include decision trees, support vector machines, hidden Markov models, maximum entropy models, maximum entropy Markov models, and conditional random field algorithms. The term recognition method based on machine learning does not require expert domain knowledge and language knowledge, and it is highly feasible to implement. It can obtain better recognition or extraction results when considering multiple term features.

目前,基于机器学习的中文领域术语识别方法中条件随机场模型应用最为广泛。例如有文献报道,“一种中医名词术语自动抽取方法”(作者是:张五辈 白宇 王裴岩 张桂平,发表于2011年出版的《沈阳航空航天大学学报》第28卷第1期第72-75页),该文公开了一种针对中医领域的基于条件随机场的术语抽取方法,该方法将中医领域术语抽取看作一个序列标注问题,将中医领域术语分布的特征量化作为训练的特征,利用CRF工具包训练出一个领域术语模型,然后利用该模型进行术语抽取。选择《名医类案》作为中医领域文本进行术语抽取实验,准确率达到83.11%,召回率达到81.04%,F测量值达到82.06%。以及文章“采用CRF技术的军事情报术语自动抽取研究”(作者是:贾美英 杨炳儒 郑德权 杨靖,发表于2009年出版的《计算机工程与应用》第45卷第32期第126-129页),该文公开了一种针对军事情报领域的基于条件随机场的术语抽取方法,该方法将领域术语识别看作一个序列标注问题,将领域术语分布的特征量化作为训练的特征,利用CRF工具包训练出一个领域术语特征模板,然后利用该模板进行领域术语抽取。实验表明,该方法对军事情报领域术语的识别结果良好,准确率可达到73.24%,召回率达到69.57%,F测量值达到71.36%。 At present, the conditional random field model is the most widely used in the recognition method of Chinese domain terms based on machine learning. For example, there is a literature report, "A method for automatic extraction of terminology in traditional Chinese medicine" (authors are: Zhang Wudai, Bai Yu, Wang Peiyan, Zhang Guiping, published in "Journal of Shenyang Aerospace University", Volume 28, Issue 1, Issue 72, published in 2011 -75 pages), this paper discloses a term extraction method based on conditional random field for the field of traditional Chinese medicine. , use the CRF toolkit to train a domain term model, and then use the model for term extraction. Selecting "Famous Doctor Class Cases" as the text in the field of traditional Chinese medicine for term extraction experiments, the accuracy rate reached 83.11%, the recall rate reached 81.04%, and the F measurement value reached 82.06%. And the article "Research on Automatic Extraction of Military Intelligence Terminology Using CRF Technology" (Authors: Jia Meiying, Yang Bingru, Zheng Dequan, Yang Jing, published in "Computer Engineering and Application", Volume 45, Issue 32, Pages 126-129, published in 2009), This paper discloses a conditional random field-based term extraction method for the military intelligence field. This method regards domain term recognition as a sequence labeling problem, quantifies the feature distribution of domain terms as the training feature, and uses the CRF toolkit to train A domain term feature template is generated, and then the domain term extraction is performed using the template. Experiments show that the method has good recognition results for terms in the field of military intelligence, with an accuracy rate of 73.24%, a recall rate of 69.57%, and an F-measurement value of 71.36%.

利用条件随机场算法进行领域术语识别时,训练语料基本上都为手动和半自动标注的,人为参与度都高,工作量大,导致普遍识别量不大,制约了该算法的识别精度和应用。同时,需要先利用通用的分词工具对语料进行分词,然后再对分词后的语料进行条件随机场训练和测试,最终才能实现术语的识别。所以利用条件随机场算法进行领域术语识别的前提是,假设现有的通用分词工具可以对该领域的词汇进行准确地分词,并认为领域术语比分词工具所分的词粒度大。但是,由于专业领域术语与普通词汇存在差距,用一般性分词工具很难实现对专业领域语料的准确分词。因此,目前互信息和条件随机场方法在领域术语识别过程中自动识别程度较低,且识别精度不高。 When the conditional random field algorithm is used for domain term recognition, the training corpus is basically manually and semi-automatically annotated. The degree of human participation is high and the workload is heavy, resulting in a small amount of general recognition, which restricts the recognition accuracy and application of the algorithm. At the same time, it is necessary to use general-purpose word segmentation tools to segment the corpus, and then conduct conditional random field training and testing on the segmented corpus to finally realize term recognition. Therefore, the premise of using conditional random field algorithm to identify domain terms is assuming that the existing general word segmentation tools can accurately segment the vocabulary in the field, and it is believed that the domain term is larger than the word granularity segmented by the segmentation tool. However, due to the gap between professional field terminology and common vocabulary, it is difficult to achieve accurate word segmentation of professional field corpus with general word segmentation tools. Therefore, the current mutual information and conditional random field methods have a low degree of automatic recognition in the process of field term recognition, and the recognition accuracy is not high.

发明内容 Contents of the invention

鉴于以上所述现有技术存在的问题,本发明的目的是提供一种基于互信息和条件随机场模型的中文领域术语识别方法,该方法在术语识别时,不仅能克服合法术语的数据稀疏,降低了条件随机场算法的运算量,而且能够提高中文领域术语识别精度。 In view of the problems in the prior art described above, the purpose of the present invention is to provide a method for identifying Chinese domain terms based on mutual information and conditional random field models, which can not only overcome the data sparseness of legal terms when identifying terms, but also The computational load of the conditional random field algorithm is reduced, and the recognition accuracy of Chinese domain terms can be improved.

为了达到上述目的,本发明采用下述技术方案: In order to achieve the above object, the present invention adopts following technical scheme:

本发明的基于互信息和条件随机场模型的中文领域术语识别方法,具体步骤如下: The Chinese domain term recognition method based on mutual information and conditional random field model of the present invention, concrete steps are as follows:

(1)、收集领域文本语料,对语料中所有的标点符号、空格、数字、ASCII字符以及汉字以外字符进行标记; (1) Collect domain text corpus and mark all punctuation marks, spaces, numbers, ASCII characters and characters other than Chinese characters in the corpus;

(2)、设置字串                                                

Figure 2012105287348100002DEST_PATH_IMAGE001
,计算字串
Figure 957180DEST_PATH_IMAGE001
的互信息值; (2), set the string
Figure 2012105287348100002DEST_PATH_IMAGE001
, to calculate the string
Figure 957180DEST_PATH_IMAGE001
mutual information value;

(3)、计算字串

Figure 814278DEST_PATH_IMAGE001
左右信息熵; (3), calculate the string
Figure 814278DEST_PATH_IMAGE001
left and right information entropy;

(4)、定义字串

Figure 318684DEST_PATH_IMAGE001
评价函数,设置评价函数阈值,计算各字串的评价函数值,确定字串
Figure 15561DEST_PATH_IMAGE001
为词,依次比较该字串
Figure 410771DEST_PATH_IMAGE001
中前一字
Figure 2012105287348100002DEST_PATH_IMAGE003
的评价函数值与后一字
Figure 21881DEST_PATH_IMAGE004
评价函数值相比较,得到各字串
Figure 922972DEST_PATH_IMAGE001
中对应的比值,其比值再与评价函数阈值比较,逐一对字义字串
Figure 123326DEST_PATH_IMAGE001
分词;  (4), define the string
Figure 318684DEST_PATH_IMAGE001
evaluation function, set the evaluation function Threshold, calculate the evaluation function value of each string, determine the string
Figure 15561DEST_PATH_IMAGE001
is a word, compare the strings in turn
Figure 410771DEST_PATH_IMAGE001
Chinese character
Figure 2012105287348100002DEST_PATH_IMAGE003
The value of the evaluation function and the next word
Figure 21881DEST_PATH_IMAGE004
Compare the evaluation function values to get each string
Figure 922972DEST_PATH_IMAGE001
The corresponding ratio in , and its ratio is then compared with the evaluation function Threshold comparison, one-by-one for literal strings
Figure 123326DEST_PATH_IMAGE001
Participle;

(5)、以词、词性、词的出现频率的随机场的训练特征,利用条件随机场方法训练出一个领域术语条件随机场模型,用该模型对进行领域术语识别。 (5) Using the training characteristics of the random field of word, part of speech, and word frequency, use the conditional random field method to train a field term conditional random field model, and use this model to identify field terms.

上述步骤(2)中所述的(2)设置字串

Figure 854521DEST_PATH_IMAGE001
,计算字串
Figure 113464DEST_PATH_IMAGE001
的互信息值,其计算公式如下: (2) setting string described in step (2) above
Figure 854521DEST_PATH_IMAGE001
, to calculate the string
Figure 113464DEST_PATH_IMAGE001
The mutual information value of , its calculation formula is as follows:

假设一个领域术语是由n个字组成,如果字串

Figure 415264DEST_PATH_IMAGE001
为一个领域术语,那么字串
Figure 417855DEST_PATH_IMAGE001
Figure 2012105287348100002DEST_PATH_IMAGE005
Figure 941240DEST_PATH_IMAGE006
Figure DEST_PATH_IMAGE007
……
Figure 980872DEST_PATH_IMAGE003
个字组成,字串
Figure 222497DEST_PATH_IMAGE001
的互信息值计算公式如下: Assuming that a domain term is composed of n words, if the string
Figure 415264DEST_PATH_IMAGE001
is a domain term, then the string
Figure 417855DEST_PATH_IMAGE001
Depend on
Figure 2012105287348100002DEST_PATH_IMAGE005
,
Figure 941240DEST_PATH_IMAGE006
,
Figure DEST_PATH_IMAGE007
...
Figure 980872DEST_PATH_IMAGE003
composed of words, string
Figure 222497DEST_PATH_IMAGE001
The formula for calculating the mutual information value is as follows:

Figure 763200DEST_PATH_IMAGE008
            (1)
Figure 763200DEST_PATH_IMAGE008
(1)

其中,

Figure 203408DEST_PATH_IMAGE001
表示一个由n个字组成的字串; in,
Figure 203408DEST_PATH_IMAGE001
Represents a string consisting of n characters;

      表示组成字串

Figure 614274DEST_PATH_IMAGE001
的第i个字(i=1,2,3,…,n);  Indicates the composition of the string
Figure 614274DEST_PATH_IMAGE001
The ith word of (i=1, 2, 3, ..., n);

      

Figure 77616DEST_PATH_IMAGE010
表示语料库中字
Figure DEST_PATH_IMAGE011
出现的频次;
Figure 77616DEST_PATH_IMAGE010
Represents words in the corpus
Figure DEST_PATH_IMAGE011
frequency of occurrence;

      

Figure 484327DEST_PATH_IMAGE012
表示语料库中字
Figure 982304DEST_PATH_IMAGE006
出现的频次;
Figure 484327DEST_PATH_IMAGE012
Represents words in the corpus
Figure 982304DEST_PATH_IMAGE006
frequency of occurrence;

Figure DEST_PATH_IMAGE013
表示语料库中字出现的频次;
Figure DEST_PATH_IMAGE013
Represents words in the corpus frequency of occurrence;

表示语料库中字出现的频次; Represents words in the corpus frequency of occurrence;

Figure DEST_PATH_IMAGE015
表示字
Figure 814945DEST_PATH_IMAGE011
Figure 491914DEST_PATH_IMAGE006
Figure 929848DEST_PATH_IMAGE007
、…、
Figure 429094DEST_PATH_IMAGE003
同时出现的频次;
Figure DEST_PATH_IMAGE015
Indicates the word
Figure 814945DEST_PATH_IMAGE011
,
Figure 491914DEST_PATH_IMAGE006
,
Figure 929848DEST_PATH_IMAGE007
,...,
Figure 429094DEST_PATH_IMAGE003
frequency of simultaneous occurrences;

表示字串

Figure 811851DEST_PATH_IMAGE001
中所有字与字之间的互信息。 Represents a string
Figure 811851DEST_PATH_IMAGE001
Mutual information between all words in .

上述步骤(3)中所述的计算左右信息熵,其计算公式如下: The calculation formula for calculating the left and right information entropy described in the above step (3) is as follows:

左信息熵计算公式为:

Figure DEST_PATH_IMAGE017
     (2) The left information entropy calculation formula is:
Figure DEST_PATH_IMAGE017
(2)

右信息熵计算公式为:

Figure 550131DEST_PATH_IMAGE018
      (3) The right information entropy calculation formula is:
Figure 550131DEST_PATH_IMAGE018
(3)

其中, 

Figure 774439DEST_PATH_IMAGE001
表示为给定的一个由n个字组成的字串; in,
Figure 774439DEST_PATH_IMAGE001
Represented as a given string consisting of n characters;

       

Figure DEST_PATH_IMAGE019
分别表示出现在
Figure 258302DEST_PATH_IMAGE001
左侧和右则时的条件概率;
Figure DEST_PATH_IMAGE019
and Respectively Appear in
Figure 258302DEST_PATH_IMAGE001
Conditional probabilities for left and right time;

       

Figure 405250DEST_PATH_IMAGE022
Figure DEST_PATH_IMAGE023
表示
Figure 495566DEST_PATH_IMAGE001
左边和右边所有出现的词集合;
Figure 405250DEST_PATH_IMAGE022
and
Figure DEST_PATH_IMAGE023
express
Figure 495566DEST_PATH_IMAGE001
The set of all occurrences of words on the left and right;

      

Figure 677148DEST_PATH_IMAGE009
表示组成字串
Figure 945450DEST_PATH_IMAGE001
的第i个字,其中,i=1,2,3,…,n 。 
Figure 677148DEST_PATH_IMAGE009
Indicates the composition of the string
Figure 945450DEST_PATH_IMAGE001
The i-th word of , where i=1, 2, 3,..., n.

上述步骤(4)中所述的定义字串W评价函数,并利用评价函数对语料进行分词,是指利用步骤(2)和步骤(3)计算得到的互信息和左右信息熵值,对语料中的字串

Figure 845273DEST_PATH_IMAGE001
为词的可信度进行评价,判断该字串是否为词,其中,字串W评价函数计算公式如下:  The definition of the word string W evaluation function described in the above step (4), and the use of the evaluation function to segment the corpus refers to the use of the mutual information and left and right information entropy values calculated in steps (2) and (3), and the corpus string in
Figure 845273DEST_PATH_IMAGE001
Evaluate the credibility of the word, and judge whether the word string is a word, wherein, the word string W evaluation function calculation formula is as follows:

Figure 411383DEST_PATH_IMAGE024
                 (4)
Figure 411383DEST_PATH_IMAGE024
(4)

其中, 

Figure 509789DEST_PATH_IMAGE001
表示为给定的一个由n个字组成的字串; in,
Figure 509789DEST_PATH_IMAGE001
Represented as a given string consisting of n characters;

Figure 135943DEST_PATH_IMAGE016
表示字串
Figure 257482DEST_PATH_IMAGE001
中字符之间的互信息值;
Figure 135943DEST_PATH_IMAGE016
Represents a string
Figure 257482DEST_PATH_IMAGE001
Mutual information value between characters in ;

      

Figure DEST_PATH_IMAGE025
表示字串
Figure 440333DEST_PATH_IMAGE001
的左信息熵值;
Figure DEST_PATH_IMAGE025
Represents a string
Figure 440333DEST_PATH_IMAGE001
The left information entropy value of ;

      

Figure 596508DEST_PATH_IMAGE026
表示字串的右信息熵值;
Figure 596508DEST_PATH_IMAGE026
Represents a string The right information entropy value of ;

      

Figure DEST_PATH_IMAGE027
为平衡因子,用以调节信息熵与互信息值在字串
Figure 64715DEST_PATH_IMAGE001
评价函数中的权值。
Figure DEST_PATH_IMAGE027
is a balance factor, used to adjust the value of information entropy and mutual information in the string
Figure 64715DEST_PATH_IMAGE001
Weights in the evaluation function.

上述步骤(5)中所述的以词、词性、词的出现频率的随机场的训练特征,利用条件随机场方法训练出一个领域术语条件随机场模型,利用该模型对进行领域术语识别,其操作步骤如下: In the above step (5), the training characteristics of the random field of word, part of speech, and word occurrence frequency are used to train a field term conditional random field model by using the conditional random field method, and the field term recognition is carried out by using this model. The operation steps are as follows:

(51)、以词本身、词性、词的出现频率在语料中进行标注; (51), mark in the corpus with the word itself, part of speech, and frequency of occurrence of the word;

(52)、利用CRF++ 0.53工具包对已标注的特征序列训练,获取条件随机场参数,该条件随机场参数为该领域术语识别的条件随机场模型; (52), use the CRF++ 0.53 toolkit to train the marked feature sequence to obtain the conditional random field parameter, which is the conditional random field model for term recognition in this field;

(53)、用领域术语识别的的条件随机场模型对测试已标注的特征序列的领域术语识别。 (53). Using the conditional random field model of field term recognition to test the field term recognition of the marked feature sequence.

本发明的基于互信息和条件随机场模型的中文领域术语识别方法与现有技术相比较,具有以下效果: Compared with the prior art, the Chinese domain term recognition method based on mutual information and conditional random field model of the present invention has the following effects:

(1)、该方法将基于统计和机器学习的两类术语识别方法有机地结合在一起,有效的解决了单纯利用统计方法进行术语识别时的数据稀疏问题; (1) This method organically combines two types of term recognition methods based on statistics and machine learning, and effectively solves the problem of data sparseness when simply using statistical methods for term recognition;

(2)、该方法利用互信息算法对语料进行分词和标注,实现了语料的自动标注; (2) This method uses the mutual information algorithm to segment and label the corpus, realizing the automatic labeling of the corpus;

(3)、该方法仅采用了3个最为普通的词特征,作为条件随机场方法的训练,使该方法具有较强的领域通用性,有效地降低了条件随机场的运算量,减少了条件随机场的训练时间。 (3) This method only uses the three most common word features as the training of the conditional random field method, which makes the method have strong field versatility, effectively reduces the calculation amount of the conditional random field, and reduces the condition Random field training time.

附图说明 Description of drawings

图1为本发明的基于互信息和条件随机场模型的中文领域术语识别方法的流程图; Fig. 1 is the flow chart of the Chinese domain term recognition method based on mutual information and conditional random field model of the present invention;

图2是图1中步骤(4)的流程图; Fig. 2 is the flowchart of step (4) among Fig. 1;

图3是图1中步骤(5)的流程图。 Fig. 3 is a flowchart of step (5) in Fig. 1 .

具体实施方式 Detailed ways

下面结合附图和具体实施方式对本发明作进一步详细的描述。 The present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.

本实施例以植物——竹子的领域术语识别作为实例对本发明进行说明,但不用来限制本发明的范围。 In this embodiment, the field term recognition of plant-bamboo is taken as an example to illustrate the present invention, but it is not used to limit the scope of the present invention.

参照图1,本发明的基于互信息和条件随机场模型的中文领域术语识别方法,包括如下步骤: With reference to Fig. 1, the Chinese field term recognition method based on mutual information and conditional random field model of the present invention, comprises the steps:

(1)、收集领域文本语料,对语料中所有的标点符号、空格、数字、ASCII字符以及汉字以外字符进行标记。 (1) Collect domain text corpus, and mark all punctuation marks, spaces, numbers, ASCII characters and characters other than Chinese characters in the corpus.

    例如,本实例选取《中国植物志》第9卷竹亚科的电子书稿作为领域文本语料。 For example, this example selects the e-book manuscript of the ninth volume of "Flora of China" as the domain text corpus.

首先,将语料按4:1的比例随机地划分为:训练语料和测试语料两部分; First, the corpus is randomly divided into two parts according to the ratio of 4:1: training corpus and test corpus;

然后,检索出语料中所有标点符号、空格、数字、ASCII字符以及汉字以外字符,在上述字符前、后分别用“//”符号进行标记; Then, retrieve all punctuation marks, spaces, numbers, ASCII characters and characters other than Chinese characters in the corpus, and mark them with "//" symbols before and after the above characters;

最后,参照汉语词性表,对所有代词、叹词、助词和虚词,以及首字为“和、有、的、得、将、把、从、了、是、则、在、每、这、该、给、所、使、为、不、着、了、很、该、与、得、的”词的前、后分别用“//”符号进行标记。 Finally, referring to the Chinese Parts of Speech Table, for all pronouns, interjections, auxiliary words and function words, as well as the initial characters of "he, you, de, get, will, put, from, got, is, then, in, every, this, the , Give, So, Make, For, Not, Write, Got, Very, Should, With, Get, The front and back of the words are marked with "//" symbols respectively.

(2)、设置字串

Figure 785678DEST_PATH_IMAGE001
,计算字串的互信息值,其计算公式如下: (2), set the string
Figure 785678DEST_PATH_IMAGE001
, to calculate the string The mutual information value of , its calculation formula is as follows:

假设一个领域术语是由n个字组成,如果字串

Figure 826632DEST_PATH_IMAGE001
为一个领域术语,那么字串
Figure 922764DEST_PATH_IMAGE001
Figure 634368DEST_PATH_IMAGE005
Figure 575255DEST_PATH_IMAGE006
Figure 714112DEST_PATH_IMAGE007
……
Figure 94278DEST_PATH_IMAGE003
个字组成,字串
Figure 422622DEST_PATH_IMAGE001
的互信息值计算公式如下: Assuming that a domain term is composed of n words, if the string
Figure 826632DEST_PATH_IMAGE001
is a domain term, then the string
Figure 922764DEST_PATH_IMAGE001
Depend on
Figure 634368DEST_PATH_IMAGE005
,
Figure 575255DEST_PATH_IMAGE006
,
Figure 714112DEST_PATH_IMAGE007
...
Figure 94278DEST_PATH_IMAGE003
composed of words, string
Figure 422622DEST_PATH_IMAGE001
The formula for calculating the mutual information value is as follows:

Figure 407896DEST_PATH_IMAGE028
          (1)
Figure 407896DEST_PATH_IMAGE028
(1)

其中,

Figure 717654DEST_PATH_IMAGE001
表示一个由n个字组成的字串; in,
Figure 717654DEST_PATH_IMAGE001
Represents a string consisting of n characters;

      

Figure 585116DEST_PATH_IMAGE009
表示组成字串
Figure 451572DEST_PATH_IMAGE001
的第i个字,其中,i=1,2,3,…,n; 
Figure 585116DEST_PATH_IMAGE009
Indicates the composition of the string
Figure 451572DEST_PATH_IMAGE001
The i-th word of , where i=1, 2, 3,..., n;

      

Figure 291352DEST_PATH_IMAGE010
表示语料库中字出现的频次;
Figure 291352DEST_PATH_IMAGE010
Represents words in the corpus frequency of occurrence;

      

Figure 330032DEST_PATH_IMAGE012
表示语料库中字出现的频次;
Figure 330032DEST_PATH_IMAGE012
Represents words in the corpus frequency of occurrence;

Figure 943733DEST_PATH_IMAGE013
表示语料库中字
Figure 408344DEST_PATH_IMAGE007
出现的频次;
Figure 943733DEST_PATH_IMAGE013
Represents words in the corpus
Figure 408344DEST_PATH_IMAGE007
frequency of occurrence;

Figure 188081DEST_PATH_IMAGE014
表示语料库中字出现的频次;
Figure 188081DEST_PATH_IMAGE014
Represents words in the corpus frequency of occurrence;

表示字

Figure 282442DEST_PATH_IMAGE011
Figure 558495DEST_PATH_IMAGE007
、…、同时出现的频次; Indicates the word
Figure 282442DEST_PATH_IMAGE011
, ,
Figure 558495DEST_PATH_IMAGE007
,..., frequency of simultaneous occurrences;

表示字串

Figure 37384DEST_PATH_IMAGE001
中所有字与字之间的互信息。 Represents a string
Figure 37384DEST_PATH_IMAGE001
Mutual information between all words in .

由于本发明认为中文领域术语的长度不大于4个字,并且认为中文领域术语中间不可能出现标点符号、空格、数字、ASCII字符以及汉字以外字符,同时也不可能出再叹词、虚词、指标代词等词,所以本发明对语料文本中所有字分别计算其2-word、3-word、4-word的互信息值,当遇到标记符“//”停止计算,其互信息值的计算公式参见上述发明内容中步骤(2)的公式(1)、(2)、(3)。 Because the present invention thinks that the length of the term in the Chinese field is not more than 4 characters, and thinks that punctuation marks, spaces, numbers, ASCII characters, and characters other than Chinese characters cannot appear in the middle of the Chinese field term, and it is also impossible to appear interjections, function words, and indicators Pronouns and other words, so the present invention calculates the mutual information values of 2-word, 3-word, and 4-word for all words in the corpus text, and stops calculation when the marker "//" is encountered, and the calculation of the mutual information value For formulas, refer to formulas (1), (2) and (3) in step (2) in the above summary of the invention.

例如:语料“边缘被流苏状毛//,//”,其中2-word包括:“边缘”、“缘被”、“被流”、“流苏”、“苏状”和“状毛”;3-word包括:“边缘被”、“缘被流”、“被流苏”、“流苏状”和“苏状毛”;4-word包括:“边缘被流”、“缘被流苏”、“被流苏状”和“流苏状毛”,部份互信息计算结果为:

Figure DEST_PATH_IMAGE029
Figure 39975DEST_PATH_IMAGE030
Figure DEST_PATH_IMAGE031
Figure 376409DEST_PATH_IMAGE032
; For example: the corpus "edge is fringed hair //, //", where 2-words include: "edge", "marginal cover", "be flowed", "tassel", "su-like" and "like hair"; 3-word includes: "edge is flowed", "edge is flowed", "be tasseled", "fringe-like" and "su-like hair"; 4-word includes: "edge is flowed", "edge is tasseled", "Fringed" and "fringed hair", the partial mutual information calculation results are:
Figure DEST_PATH_IMAGE029
,
Figure 39975DEST_PATH_IMAGE030
,
Figure DEST_PATH_IMAGE031
,
Figure 376409DEST_PATH_IMAGE032
, ;

(3)、计算字串

Figure 540675DEST_PATH_IMAGE001
左右信息熵,其计算公式如下:   (3), calculate the string
Figure 540675DEST_PATH_IMAGE001
Left and right information entropy, its calculation formula is as follows:

左信息熵计算公式为: The left information entropy calculation formula is:

Figure 844617DEST_PATH_IMAGE034
               (2)
Figure 844617DEST_PATH_IMAGE034
(2)

右信息熵计算公式为: The right information entropy calculation formula is:

Figure DEST_PATH_IMAGE035
                (3)
Figure DEST_PATH_IMAGE035
(3)

其中, 

Figure 385320DEST_PATH_IMAGE001
表示为给定的一个由n个字组成的字串; in,
Figure 385320DEST_PATH_IMAGE001
Represented as a given string consisting of n characters;

       

Figure 763211DEST_PATH_IMAGE019
Figure 177006DEST_PATH_IMAGE020
分别表示
Figure 640349DEST_PATH_IMAGE021
出现在
Figure 984742DEST_PATH_IMAGE001
左侧和右则时的条件概率;
Figure 763211DEST_PATH_IMAGE019
and
Figure 177006DEST_PATH_IMAGE020
Respectively
Figure 640349DEST_PATH_IMAGE021
Appear in
Figure 984742DEST_PATH_IMAGE001
Conditional probabilities for left and right time;

       

Figure 51104DEST_PATH_IMAGE023
表示
Figure 1743DEST_PATH_IMAGE001
左边和右边所有出现的词集合; and
Figure 51104DEST_PATH_IMAGE023
express
Figure 1743DEST_PATH_IMAGE001
The set of all occurrences of words on the left and right;

      表示组成字串

Figure 236732DEST_PATH_IMAGE001
的第i个字,其中,i=1,2,3,…,n。  Indicates the composition of the string
Figure 236732DEST_PATH_IMAGE001
The i-th word of , where i=1, 2, 3,..., n.

判断一个字串是否为词,不仅要考虑字串内部字与字之间的结合紧密度,即字之间互信息的大小;同时,还要考虑字串之间的边界自由程度,即在字串边界出现的邻接字的种类越多,认为字串左右信息熵越大,也就是字串边界的自由度越大,其左右信息熵的计算公式参见上述发明内容中步骤(3)的公式(2)、(3)。 To judge whether a string is a word or not, not only the degree of combination between characters within the string, that is, the size of the mutual information between words, but also the degree of freedom of boundaries between strings, that is, the The more types of adjacent words appearing on the string boundary, the greater the left and right information entropy of the string is considered, that is, the greater the degree of freedom of the word string boundary, and the calculation formula of the left and right information entropy refers to the formula of step (3) in the above-mentioned content of the invention ( 2), (3).

例如:语料“边缘被流苏状毛//,//”中,部份左信息熵计算结果为:

Figure 992330DEST_PATH_IMAGE036
Figure DEST_PATH_IMAGE037
,
Figure 430264DEST_PATH_IMAGE038
Figure DEST_PATH_IMAGE039
Figure 178777DEST_PATH_IMAGE040
Figure DEST_PATH_IMAGE041
;右信息熵计算结果为:
Figure 385768DEST_PATH_IMAGE042
Figure 309337DEST_PATH_IMAGE044
, 
Figure 234568DEST_PATH_IMAGE046
Figure DEST_PATH_IMAGE047
; For example: in the corpus "the edge is fringed hair //, //", the calculation result of part of the left information entropy is:
Figure 992330DEST_PATH_IMAGE036
,
Figure DEST_PATH_IMAGE037
,
Figure 430264DEST_PATH_IMAGE038
,
Figure DEST_PATH_IMAGE039
,
Figure 178777DEST_PATH_IMAGE040
,
Figure DEST_PATH_IMAGE041
; The calculation result of right information entropy is:
Figure 385768DEST_PATH_IMAGE042
, ,
Figure 309337DEST_PATH_IMAGE044
, ,
Figure 234568DEST_PATH_IMAGE046
,
Figure DEST_PATH_IMAGE047
;

(4)、定义字串

Figure 521193DEST_PATH_IMAGE001
评价函数,设置评价函数
Figure 848269DEST_PATH_IMAGE002
阈值,计算各字串的评价函数值,确定字串为词,依次比较该字串
Figure 92616DEST_PATH_IMAGE001
中前一字
Figure 120615DEST_PATH_IMAGE003
的评价函数值与后一字
Figure 302198DEST_PATH_IMAGE004
评价函数值相比较,得到各字串
Figure 819767DEST_PATH_IMAGE001
中对应的比值,其比值再与评价函数
Figure 454011DEST_PATH_IMAGE002
阈值比较,逐一对字义字串
Figure 20121DEST_PATH_IMAGE001
分词,其操作步骤如下: (4), define the string
Figure 521193DEST_PATH_IMAGE001
evaluation function, set the evaluation function
Figure 848269DEST_PATH_IMAGE002
Threshold, calculate the evaluation function value of each string, determine the string is a word, compare the strings in turn
Figure 92616DEST_PATH_IMAGE001
Chinese character
Figure 120615DEST_PATH_IMAGE003
The value of the evaluation function and the next word
Figure 302198DEST_PATH_IMAGE004
Compare the evaluation function values to get each string
Figure 819767DEST_PATH_IMAGE001
The corresponding ratio in , and its ratio is then compared with the evaluation function
Figure 454011DEST_PATH_IMAGE002
Threshold comparison, one-by-one for literal strings
Figure 20121DEST_PATH_IMAGE001
Word segmentation, the operation steps are as follows:

(41)、定义字串

Figure 321789DEST_PATH_IMAGE001
评价函数,其计算表达式为: (41), define the string
Figure 321789DEST_PATH_IMAGE001
Evaluation function, its calculation expression is:

Figure 760992DEST_PATH_IMAGE024
                  (4)
Figure 760992DEST_PATH_IMAGE024
(4)

其中,

Figure 882532DEST_PATH_IMAGE001
表示为给定的一个由n个字组成的字串; in,
Figure 882532DEST_PATH_IMAGE001
Represented as a given string consisting of n characters;

Figure 252333DEST_PATH_IMAGE016
表示字串
Figure 408508DEST_PATH_IMAGE001
中字符之间的互信息值;
Figure 252333DEST_PATH_IMAGE016
Represents a string
Figure 408508DEST_PATH_IMAGE001
Mutual information value between characters in ;

      

Figure 205563DEST_PATH_IMAGE025
表示字串
Figure 876716DEST_PATH_IMAGE001
的左信息熵值;
Figure 205563DEST_PATH_IMAGE025
Represents a string
Figure 876716DEST_PATH_IMAGE001
The left information entropy value of ;

      

Figure 784629DEST_PATH_IMAGE026
表示字串的右信息熵值;
Figure 784629DEST_PATH_IMAGE026
Represents a string The right information entropy value of ;

      为平衡因子,用以调节信息熵与互信息值在评价函数中的权值。 It is a balance factor, which is used to adjust the weight of information entropy and mutual information value in the evaluation function.

(42)、分别计算评价函数数值,确定字串

Figure 672448DEST_PATH_IMAGE001
为词。 (42), respectively calculate the value of the evaluation function, determine the string
Figure 672448DEST_PATH_IMAGE001
for words.

根据上述发明内容中的步骤(4)的评价函数公式分别计算所有字串的评价函数值,其中

Figure 384052DEST_PATH_IMAGE027
取0.5,并认为当评价函数大于阈值0.8时,该字串
Figure 388097DEST_PATH_IMAGE001
为词, Calculate the evaluation function values of all strings according to the evaluation function formula of step (4) in the above-mentioned summary of the invention, wherein
Figure 384052DEST_PATH_IMAGE027
Take 0.5, and think that when the evaluation function When greater than the threshold 0.8, the string
Figure 388097DEST_PATH_IMAGE001
for words,

例如:语料“边缘被流苏状毛//,//”,部份评价函数计算结果为:

Figure 33842DEST_PATH_IMAGE048
Figure 549137DEST_PATH_IMAGE050
Figure DEST_PATH_IMAGE051
, 
Figure 102388DEST_PATH_IMAGE052
Figure DEST_PATH_IMAGE053
; For example: in the corpus "the edge is fringed hair //, //", the calculation result of some evaluation functions is:
Figure 33842DEST_PATH_IMAGE048
, ,
Figure 549137DEST_PATH_IMAGE050
,
Figure DEST_PATH_IMAGE051
,
Figure 102388DEST_PATH_IMAGE052
,
Figure DEST_PATH_IMAGE053
;

(43)、依次比较上述字串

Figure 412147DEST_PATH_IMAGE001
中前一字
Figure 482871DEST_PATH_IMAGE003
的评价函数值与后一字
Figure 598594DEST_PATH_IMAGE003
评价函数值相比,得到各字串
Figure 438374DEST_PATH_IMAGE001
中对应的比值“?”,其比值再与评价函数
Figure 919034DEST_PATH_IMAGE002
阈值比较,逐一对字义字串
Figure 477055DEST_PATH_IMAGE001
分词。 (43), compare the above strings in turn
Figure 412147DEST_PATH_IMAGE001
Chinese character
Figure 482871DEST_PATH_IMAGE003
The value of the evaluation function and the next word
Figure 598594DEST_PATH_IMAGE003
Evaluation function values are compared to get each string
Figure 438374DEST_PATH_IMAGE001
The corresponding ratio "?" in, and its ratio is then compared with the evaluation function
Figure 919034DEST_PATH_IMAGE002
Threshold comparison, one-by-one for literal strings
Figure 477055DEST_PATH_IMAGE001
Participle.

例如,首先从语料的第一个字开始,分别选取长度为4、3、2、1的子字串,记作

Figure 881622DEST_PATH_IMAGE054
Figure DEST_PATH_IMAGE055
Figure 575909DEST_PATH_IMAGE056
Figure DEST_PATH_IMAGE057
; For example, firstly, starting from the first word of the corpus, select substrings with lengths of 4, 3, 2, and 1 respectively, and write them as
Figure 881622DEST_PATH_IMAGE054
,
Figure DEST_PATH_IMAGE055
,
Figure 575909DEST_PATH_IMAGE056
and
Figure DEST_PATH_IMAGE057
;

然后,对字串

Figure 555366DEST_PATH_IMAGE054
Figure 335103DEST_PATH_IMAGE055
的评价函数进行比较,如果
Figure 730313DEST_PATH_IMAGE058
,认为字串为新词,d在字串
Figure 180197DEST_PATH_IMAGE054
前后分别以符号“*”进行标注;反之,认为字串
Figure 447230DEST_PATH_IMAGE054
不是新词,则其丢弃尾部的最后一个字,对
Figure DEST_PATH_IMAGE059
Figure 646130DEST_PATH_IMAGE056
的评价函数进行比较,如果
Figure 377326DEST_PATH_IMAGE060
,认为字串
Figure 370690DEST_PATH_IMAGE059
为新词,在字串
Figure 938068DEST_PATH_IMAGE059
前后分别以符号“*”进行标注;反之,认为字串
Figure 675080DEST_PATH_IMAGE059
不是新词,其丢弃尾部的最后一个字对
Figure 198465DEST_PATH_IMAGE056
的评价函数进行判断,如果
Figure DEST_PATH_IMAGE061
,认为字串
Figure 690627DEST_PATH_IMAGE056
为新词,在字串
Figure 932252DEST_PATH_IMAGE056
前后分别以符号“*”进行标注;反之,认为字串
Figure 207376DEST_PATH_IMAGE057
为新词,在字串
Figure 660966DEST_PATH_IMAGE057
前后分别以符号“*”进行标注;只要有新词被标注,就从新词后的第一个字开始,再分别选取长度为4、3、2、1的子字串,记作
Figure 996132DEST_PATH_IMAGE054
Figure 866185DEST_PATH_IMAGE056
,重新进行评价函数的比较,当遇到“//”符号跳过。如此反复, 直至所以语料处理完为止,例如:语料“边缘被流苏状毛//,//”,首先,从第一个字开始截取长度分别为4、3、2、1的子字串,即:“边缘被流”、“边缘被”、“边缘”和“边”;然后,首先判断
Figure 135810DEST_PATH_IMAGE062
是否大于等于0.8,根据步骤(41)评价函数的计算结果,可知
Figure DEST_PATH_IMAGE063
小于0.8,即字串“边缘被流”不是新词;然后,判断
Figure 899498DEST_PATH_IMAGE064
是否大于等于0.8,根据步骤(41)评价函数的计算结果,可知
Figure DEST_PATH_IMAGE065
小于0.8,故字串“边缘被”也不是新词;接着判断
Figure 782003DEST_PATH_IMAGE066
是否大于等于0.8,根据步骤(41)评价函数的计算结果,可知
Figure DEST_PATH_IMAGE067
大于0.8,故字串“边缘”是新词;当有判断出新词后,从新词后的第一个字开始再选取4、3、2、1个字串,作为新一轮的作
Figure 196804DEST_PATH_IMAGE054
Figure 139352DEST_PATH_IMAGE055
Figure 577287DEST_PATH_IMAGE056
,即“被流苏状”、“波流苏”、“被流”和“被”,再重复以上步骤进行比较,当遇到“//”符号跳过,直到结束,所以语料“边缘被流苏状毛//,//”,最后分词结果为“*边缘*被*流苏状*毛//,//” ; Then, for the string
Figure 555366DEST_PATH_IMAGE054
and
Figure 335103DEST_PATH_IMAGE055
The evaluation function is compared, if
Figure 730313DEST_PATH_IMAGE058
, consider the string as a new word, d in the string
Figure 180197DEST_PATH_IMAGE054
Mark with the symbol "*" before and after; otherwise, consider the string
Figure 447230DEST_PATH_IMAGE054
is not a new word, it discards the last word at the end, for
Figure DEST_PATH_IMAGE059
and
Figure 646130DEST_PATH_IMAGE056
The evaluation function is compared, if
Figure 377326DEST_PATH_IMAGE060
, consider the string
Figure 370690DEST_PATH_IMAGE059
for new words, in the string
Figure 938068DEST_PATH_IMAGE059
Mark with the symbol "*" before and after; otherwise, consider the string
Figure 675080DEST_PATH_IMAGE059
is not a new word, which discards the last word pair at the end
Figure 198465DEST_PATH_IMAGE056
Evaluation function to judge, if
Figure DEST_PATH_IMAGE061
, consider the string
Figure 690627DEST_PATH_IMAGE056
for new words, in the string
Figure 932252DEST_PATH_IMAGE056
Mark with the symbol "*" before and after; otherwise, consider the string
Figure 207376DEST_PATH_IMAGE057
for new words, in the string
Figure 660966DEST_PATH_IMAGE057
The symbol "*" is used to mark the front and back respectively; as long as a new word is marked, start from the first word after the new word, and then select substrings with lengths of 4, 3, 2, and 1 respectively, and write them as
Figure 996132DEST_PATH_IMAGE054
, ,
Figure 866185DEST_PATH_IMAGE056
and , to re-comparison the evaluation function, and skip when the "//" symbol is encountered. Repeat this until all the corpus is processed, for example: the corpus "the edge is fringed hair //, //", first, start from the first word to intercept substrings with lengths of 4, 3, 2, 1 respectively, That is: "edge is flowed", "edge is", "edge" and "edge"; then, first judge
Figure 135810DEST_PATH_IMAGE062
Whether it is greater than or equal to 0.8, according to the calculation result of the evaluation function in step (41), it can be known that
Figure DEST_PATH_IMAGE063
is less than 0.8, that is, the word string "edge is flowed" is not a new word; then, judge
Figure 899498DEST_PATH_IMAGE064
Whether it is greater than or equal to 0.8, according to the calculation result of the evaluation function in step (41), it can be known that
Figure DEST_PATH_IMAGE065
is less than 0.8, so the word string "marginal quilt" is not a new word; then judge
Figure 782003DEST_PATH_IMAGE066
Whether it is greater than or equal to 0.8, according to the calculation result of the evaluation function in step (41), it can be known that
Figure DEST_PATH_IMAGE067
is greater than 0.8, so the word string "edge" is a new word; when a new word is judged, select 4, 3, 2, and 1 word strings from the first word after the new word as a new round of writing
Figure 196804DEST_PATH_IMAGE054
,
Figure 139352DEST_PATH_IMAGE055
,
Figure 577287DEST_PATH_IMAGE056
and , that is, "be fringed", "wave tassel", "be flowed" and "be", repeat the above steps for comparison, when encountering the "//" symbol skip until the end, so the corpus "edge is fringed" Mao//, //", the final word segmentation result is "*edge*be*fringed*hair//,//";

(5)、以词、词性、词的出现频率的随机场的训练特征,利用条件随机场训练出一个领域术语条件随机场模型,用该模型对进行领域术语识别,其操作步骤如下:  (5) Using the training characteristics of the random field of word, part of speech, and word frequency, use the conditional random field to train a field term conditional random field model, and use this model to identify field terms. The operation steps are as follows:

(51)、以词本身、词性、词的出现频率在语料中进行标注,其具体如下: (51), mark in the corpus with the word itself, part of speech, and frequency of occurrence of the word, as follows:

依次对字义字串

Figure 283523DEST_PATH_IMAGE001
分词标注特征序列,该词的标注的特征序列分别为:当前词本身;当前词的词性;当前词的出现频率,采用K-Means聚类方法,将上述当前词的出现频率分为10个等级,每个等级为一类,10个类分别表示为A、B、C、D、E、F、G、H、I、J、K,将已标注的特征序列分为:训练已标注的特征序列、测试已标注的特征序列两部份; literal string
Figure 283523DEST_PATH_IMAGE001
The feature sequence of word segmentation, the feature sequence of the tag of the word is: the current word itself; the part of speech of the current word; the frequency of occurrence of the current word, using the K-Means clustering method to divide the frequency of the above current word into 10 levels , each level is a class, and the 10 classes are respectively represented as A, B, C, D, E, F, G, H, I, J, K, and the marked feature sequence is divided into: training marked features Sequence, test the two parts of the marked feature sequence;

(52)、利用CRF++ 0.53工具包对已标注的特征序列训练,获取条件随机场参数,条件随机场参数为领域术语识别的条件随机场模型; (52), use the CRF++ 0.53 toolkit to train the marked feature sequence, and obtain the conditional random field parameters, which are the conditional random field models for domain term recognition;

(53)、用领域术语识别的条件随机场模型对测试已标注的特征序列的领域术语识别,其具体如下: (53), use the conditional random field model of field term recognition to test the field term recognition of the marked feature sequence, which is as follows:

将测试已标注的特征序列输入到步骤(5.2)训练后获得领域术语识别的条件随机场模型,利用该条件随机场模型 ,计算出特征值,识别出领域术语,输出结果为识别出的领域术语,例如:语料“边缘被流苏状毛//,//”,最终识别出“边缘”和“流苏状”为领域术语。 Input the marked feature sequence of the test into the conditional random field model for field term recognition after training in step (5.2), use the conditional random field model to calculate the feature value, identify the field term, and output the recognized field term , For example: the corpus "edge is fringed hair //, //", and finally recognizes "edge" and "fringe" as domain terms.

以上为本发明的最佳实施方式,依据本发明公开的内容,本领域技术人员能够显而易见地想到一些雷同、替代方案,均应属于本发明的技术创新范围。 The above are the best implementation modes of the present invention. According to the disclosed content of the present invention, those skilled in the art can obviously think of some similarities and alternatives, which should all belong to the technical innovation scope of the present invention.

Claims (5)

1. Chinese field term recognition methods based on mutual information and conditional random field models, concrete steps are as follows:
(1), the assembling sphere corpus of text, character beyond punctuation marks all in the language material, space, numeral, ascii character and the Chinese character is carried out mark;
(2), word string is set
Figure 649353DEST_PATH_IMAGE001
, calculate word string
Figure 95378DEST_PATH_IMAGE001
The mutual information value;
(3), calculate word string
Figure 288462DEST_PATH_IMAGE001
Left and right sides information entropy;
(4), definition word string Evaluation function arranges evaluation function
Figure 10747DEST_PATH_IMAGE002
Threshold value is calculated the evaluation function value of each word string, determines word string
Figure 73513DEST_PATH_IMAGE001
Be word, successively this word string relatively
Figure 58786DEST_PATH_IMAGE001
Middle prev word
Figure 368545DEST_PATH_IMAGE003
Evaluation function value and a rear word
Figure 501586DEST_PATH_IMAGE004
The evaluation function value is compared, and obtains each word string
Figure 554992DEST_PATH_IMAGE001
The ratio of middle correspondence, its ratio again with evaluation function
Figure 394772DEST_PATH_IMAGE002
Threshold ratio, one by one to meaning of word word string
Figure 688482DEST_PATH_IMAGE001
Participle;
(5), with the training characteristics of the random field of the frequency of occurrences of word, part of speech, word, utilize the condition random field method to train a field term conditional random field models, with this model to carrying out field term identification.
2. the Chinese field term recognition methods based on mutual information and conditional random field models according to claim 1 is characterized in that, described in the above-mentioned steps (2) word string is set
Figure 980923DEST_PATH_IMAGE001
, calculate word string
Figure 838020DEST_PATH_IMAGE001
The mutual information value, its computing formula is as follows:
Suppose that a field term is comprised of n word, if word string
Figure 594624DEST_PATH_IMAGE001
It is field term, so a word string
Figure 246185DEST_PATH_IMAGE001
By
Figure 291501DEST_PATH_IMAGE005
,
Figure 499760DEST_PATH_IMAGE006
,
Figure 48553DEST_PATH_IMAGE007
Figure 136595DEST_PATH_IMAGE003
Individual word forms, word string
Figure 465945DEST_PATH_IMAGE001
Mutual information value computing formula as follows:
Figure 399266DEST_PATH_IMAGE008
(1)
Wherein,
Figure 68145DEST_PATH_IMAGE001
Represent a word string that is formed by n word;
Figure 137207DEST_PATH_IMAGE009
Expression forms word string
Figure 625957DEST_PATH_IMAGE001
I word (i=1,2,3 ..., n);
Figure 628548DEST_PATH_IMAGE010
Word in the expression corpus
Figure 214251DEST_PATH_IMAGE011
The frequency that occurs;
Word in the expression corpus
Figure 620141DEST_PATH_IMAGE006
The frequency that occurs;
Figure 973893DEST_PATH_IMAGE013
Word in the expression corpus The frequency that occurs;
Figure 952531DEST_PATH_IMAGE014
Word in the expression corpus
Figure 478190DEST_PATH_IMAGE003
The frequency that occurs;
Figure 822583DEST_PATH_IMAGE015
The expression word
Figure 320561DEST_PATH_IMAGE011
, ,
Figure 855896DEST_PATH_IMAGE007
...,
Figure 738401DEST_PATH_IMAGE003
The frequency that occurs simultaneously;
Figure 90885DEST_PATH_IMAGE016
The expression word string
Figure 767854DEST_PATH_IMAGE001
In mutual information between all words and the word.
3. the Chinese field term recognition methods based on mutual information and conditional random field models according to claim 1 is characterized in that, the calculating left and right sides information entropy described in the above-mentioned steps (3), and its computing formula is as follows:
Left information entropy computing formula is:
Figure 268105DEST_PATH_IMAGE017
(2)
Right information entropy computing formula is:
Figure 954302DEST_PATH_IMAGE018
(3)
Wherein,
Figure 161292DEST_PATH_IMAGE001
Be expressed as a given word string that is formed by n word;
Figure 274742DEST_PATH_IMAGE019
With
Figure 13022DEST_PATH_IMAGE020
Respectively expression Appear at
Figure 564406DEST_PATH_IMAGE001
Left side and right conditional probability then the time;
Figure 848756DEST_PATH_IMAGE022
With
Figure 58021DEST_PATH_IMAGE023
Expression
Figure 86020DEST_PATH_IMAGE001
The set of words that the left side and the right occur;
Figure 267602DEST_PATH_IMAGE009
Expression forms word string
Figure 722854DEST_PATH_IMAGE001
I word, wherein, i=1,2,3 ..., n.
4. the Chinese field term recognition methods based on mutual information and conditional random field models according to claim 1, it is characterized in that, definition word string W evaluation function described in the above-mentioned steps (4), and utilize evaluation function that language material is carried out participle, refer to the mutual information and the left and right sides information entropy that utilize step (2) and step (3) to calculate, to the word string in the language material
Figure 456235DEST_PATH_IMAGE001
Estimate for the confidence level of word, judge whether this word string is word, and wherein, word string W evaluation function computing formula is as follows:
Figure 22345DEST_PATH_IMAGE024
(4)
Wherein, Be expressed as a given word string that is formed by n word;
Figure 684588DEST_PATH_IMAGE016
The expression word string
Figure 868444DEST_PATH_IMAGE001
Mutual information value between the middle character;
The expression word string
Figure 394421DEST_PATH_IMAGE001
Left information entropy;
Figure 191475DEST_PATH_IMAGE026
The expression word string
Figure 613360DEST_PATH_IMAGE001
Right information entropy;
Be balance factor, in order to regulate information entropy and mutual information value in word string
Figure 531955DEST_PATH_IMAGE001
Weights in the evaluation function.
5. the Chinese field term recognition methods based on mutual information and conditional random field models according to claim 1, it is characterized in that, the training characteristics with the random field of the frequency of occurrences of word, part of speech, word described in the above-mentioned steps (5), utilize the condition random field method to train a field term conditional random field models, utilize this model to carrying out field term identification, its operation steps is as follows:
(51), the frequency of occurrences with word itself, part of speech, word marks in language material;
(52), utilize CRF++ 0.53 kit to the training of the characteristic sequence that marked, obtain the condition random field parameter, the conditional random field models that this condition random field parameter is identified for this field term;
(53), with field term identification the field term identification of conditional random field models characteristic sequence that test has been marked.
CN201210528734.8A 2012-12-11 2012-12-11 Based on mutual information and the Chinese domain term recognition method of conditional random field models Expired - Fee Related CN103049501B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210528734.8A CN103049501B (en) 2012-12-11 2012-12-11 Based on mutual information and the Chinese domain term recognition method of conditional random field models

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210528734.8A CN103049501B (en) 2012-12-11 2012-12-11 Based on mutual information and the Chinese domain term recognition method of conditional random field models

Publications (2)

Publication Number Publication Date
CN103049501A true CN103049501A (en) 2013-04-17
CN103049501B CN103049501B (en) 2016-08-03

Family

ID=48062142

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210528734.8A Expired - Fee Related CN103049501B (en) 2012-12-11 2012-12-11 Based on mutual information and the Chinese domain term recognition method of conditional random field models

Country Status (1)

Country Link
CN (1) CN103049501B (en)

Cited By (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103593427A (en) * 2013-11-07 2014-02-19 清华大学 New word searching method and system
CN103778243A (en) * 2014-02-11 2014-05-07 北京信息科技大学 Domain term extraction method
CN103902673A (en) * 2014-03-19 2014-07-02 新浪网技术(中国)有限公司 Anti-garbage-filtering rule upgrading method and device
CN104572621A (en) * 2015-01-05 2015-04-29 语联网(武汉)信息技术有限公司 Decision tree based term judgment method
CN104679885A (en) * 2015-03-17 2015-06-03 北京理工大学 User search string organization name recognition method based on semantic feature model
CN105183923A (en) * 2015-10-27 2015-12-23 上海智臻智能网络科技股份有限公司 New word discovery method and device
CN105224682A (en) * 2015-10-27 2016-01-06 上海智臻智能网络科技股份有限公司 New word discovery method and device
CN105260362A (en) * 2015-10-30 2016-01-20 小米科技有限责任公司 New word extraction method and device
CN105389349A (en) * 2015-10-27 2016-03-09 上海智臻智能网络科技股份有限公司 Dictionary updating method and apparatus
CN106021230A (en) * 2016-05-19 2016-10-12 无线生活(杭州)信息科技有限公司 Word segmentation method and word segmentation apparatus
CN106095753A (en) * 2016-06-07 2016-11-09 大连理工大学 A kind of financial field based on comentropy and term credibility term recognition methods
WO2016179988A1 (en) * 2015-05-12 2016-11-17 深圳市华傲数据技术有限公司 Chinese address parsing and annotation method
CN106202056A (en) * 2016-07-26 2016-12-07 北京智能管家科技有限公司 Chinese word segmentation scene library update method and system
CN106445921A (en) * 2016-09-29 2017-02-22 北京理工大学 Chinese text term extracting method utilizing quadratic mutual information
CN106649661A (en) * 2016-12-13 2017-05-10 税云网络科技服务有限公司 Method and device for establishing knowledge base
CN106991085A (en) * 2017-04-01 2017-07-28 中国工商银行股份有限公司 The abbreviation generation method and device of a kind of entity
CN107291692A (en) * 2017-06-14 2017-10-24 北京百度网讯科技有限公司 Method for customizing, device, equipment and the medium of participle model based on artificial intelligence
CN107391486A (en) * 2017-07-20 2017-11-24 南京云问网络技术有限公司 A kind of field new word identification method based on statistical information and sequence labelling
CN107423278A (en) * 2016-05-23 2017-12-01 株式会社理光 The recognition methods of essential elements of evaluation, apparatus and system
CN108268440A (en) * 2017-01-04 2018-07-10 普天信息技术有限公司 A kind of unknown word identification method
CN108509425A (en) * 2018-04-10 2018-09-07 中国人民解放军陆军工程大学 Chinese new word discovery method based on novelty
CN108776653A (en) * 2018-05-25 2018-11-09 南京大学 A kind of text segmenting method of the judgement document based on PageRank and comentropy
CN108804413A (en) * 2018-04-28 2018-11-13 百度在线网络技术(北京)有限公司 The recognition methods of text cheating and device
CN109145282A (en) * 2017-06-16 2019-01-04 贵州小爱机器人科技有限公司 Punctuate model training method, punctuate method, apparatus and computer equipment
CN109492224A (en) * 2018-11-07 2019-03-19 北京金山数字娱乐科技有限公司 A kind of method and device of vocabulary building
CN109710947A (en) * 2019-01-22 2019-05-03 福建亿榕信息技术有限公司 Method and device for generating electric power professional thesaurus
CN110175331A (en) * 2019-05-29 2019-08-27 三角兽(北京)科技有限公司 Recognition methods, device, electronic equipment and the readable storage medium storing program for executing of technical term
CN111090742A (en) * 2019-12-19 2020-05-01 东软集团股份有限公司 Question and answer pair evaluation method and device, storage medium and equipment
CN115495507A (en) * 2022-11-17 2022-12-20 江苏鸿程大数据技术与应用研究院有限公司 Engineering material information price matching method, system and storage medium
CN116702786A (en) * 2023-08-04 2023-09-05 山东大学 Chinese professional term extraction method and system integrating rules and statistical features

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202043B (en) * 2016-05-20 2019-04-12 北京理工大学 A kind of new word identification immune genetic method based at word rate fitness function

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101295294A (en) * 2008-06-12 2008-10-29 昆明理工大学 Improved Bayesian Word Sense Disambiguation Method Based on Information Gain
US20100088353A1 (en) * 2006-10-17 2010-04-08 Samsung Sds Co., Ltd. Migration Apparatus Which Convert Database of Mainframe System into Database of Open System and Method for Thereof
CN102314507A (en) * 2011-09-08 2012-01-11 北京航空航天大学 Recognition ambiguity resolution method of Chinese named entity

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100088353A1 (en) * 2006-10-17 2010-04-08 Samsung Sds Co., Ltd. Migration Apparatus Which Convert Database of Mainframe System into Database of Open System and Method for Thereof
CN101295294A (en) * 2008-06-12 2008-10-29 昆明理工大学 Improved Bayesian Word Sense Disambiguation Method Based on Information Gain
CN102314507A (en) * 2011-09-08 2012-01-11 北京航空航天大学 Recognition ambiguity resolution method of Chinese named entity

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
周浪 等: "一种面向术语抽取的短语过滤技术", 《计算机工程与应用》, no. 19, 31 December 2009 (2009-12-31), pages 9 - 11 *
贾美英 等: "采用CRF技术的军事情报术语自动抽取研究", 《计算机工程与应用》, no. 32, 31 December 2009 (2009-12-31), pages 126 - 129 *
赵秦怡 等: "一种基于互信息的串扫描中文文本分词方法", 《情报杂志》, vol. 29, no. 7, 31 July 2010 (2010-07-31), pages 152 - 172 *

Cited By (53)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103593427A (en) * 2013-11-07 2014-02-19 清华大学 New word searching method and system
CN103778243A (en) * 2014-02-11 2014-05-07 北京信息科技大学 Domain term extraction method
CN103778243B (en) * 2014-02-11 2017-02-08 北京信息科技大学 Domain term extraction method
CN103902673A (en) * 2014-03-19 2014-07-02 新浪网技术(中国)有限公司 Anti-garbage-filtering rule upgrading method and device
CN103902673B (en) * 2014-03-19 2017-11-24 新浪网技术(中国)有限公司 Anti-spam filtering rule upgrade method and device
CN104572621A (en) * 2015-01-05 2015-04-29 语联网(武汉)信息技术有限公司 Decision tree based term judgment method
CN104572621B (en) * 2015-01-05 2018-01-26 语联网(武汉)信息技术有限公司 A kind of term decision method based on decision tree
CN104679885A (en) * 2015-03-17 2015-06-03 北京理工大学 User search string organization name recognition method based on semantic feature model
WO2016179988A1 (en) * 2015-05-12 2016-11-17 深圳市华傲数据技术有限公司 Chinese address parsing and annotation method
CN105389349A (en) * 2015-10-27 2016-03-09 上海智臻智能网络科技股份有限公司 Dictionary updating method and apparatus
CN105389349B (en) * 2015-10-27 2018-07-27 上海智臻智能网络科技股份有限公司 Dictionary update method and device
CN108875040A (en) * 2015-10-27 2018-11-23 上海智臻智能网络科技股份有限公司 Dictionary update method and computer readable storage medium
CN105224682B (en) * 2015-10-27 2018-06-05 上海智臻智能网络科技股份有限公司 New word discovery method and device
CN108897842A (en) * 2015-10-27 2018-11-27 上海智臻智能网络科技股份有限公司 Computer readable storage medium and computer system
CN105224682A (en) * 2015-10-27 2016-01-06 上海智臻智能网络科技股份有限公司 New word discovery method and device
CN105183923A (en) * 2015-10-27 2015-12-23 上海智臻智能网络科技股份有限公司 New word discovery method and device
CN108897842B (en) * 2015-10-27 2021-04-09 上海智臻智能网络科技股份有限公司 Computer readable storage medium and computer system
CN108875040B (en) * 2015-10-27 2020-08-18 上海智臻智能网络科技股份有限公司 Dictionary updating method and computer-readable storage medium
CN105260362A (en) * 2015-10-30 2016-01-20 小米科技有限责任公司 New word extraction method and device
CN106021230A (en) * 2016-05-19 2016-10-12 无线生活(杭州)信息科技有限公司 Word segmentation method and word segmentation apparatus
CN106021230B (en) * 2016-05-19 2018-11-23 无线生活(杭州)信息科技有限公司 A kind of segmenting method and device
CN107423278A (en) * 2016-05-23 2017-12-01 株式会社理光 The recognition methods of essential elements of evaluation, apparatus and system
CN107423278B (en) * 2016-05-23 2020-07-14 株式会社理光 Evaluation element identification method, device and system
CN106095753A (en) * 2016-06-07 2016-11-09 大连理工大学 A kind of financial field based on comentropy and term credibility term recognition methods
CN106095753B (en) * 2016-06-07 2018-11-06 大连理工大学 A kind of financial field term recognition methods based on comentropy and term confidence level
CN106202056A (en) * 2016-07-26 2016-12-07 北京智能管家科技有限公司 Chinese word segmentation scene library update method and system
CN106202056B (en) * 2016-07-26 2019-01-04 北京智能管家科技有限公司 Chinese word segmentation scene library update method and system
CN106445921A (en) * 2016-09-29 2017-02-22 北京理工大学 Chinese text term extracting method utilizing quadratic mutual information
CN106445921B (en) * 2016-09-29 2019-05-07 北京理工大学 Chinese text term extraction method using quadratic mutual information
CN106649661A (en) * 2016-12-13 2017-05-10 税云网络科技服务有限公司 Method and device for establishing knowledge base
CN108268440A (en) * 2017-01-04 2018-07-10 普天信息技术有限公司 A kind of unknown word identification method
CN106991085B (en) * 2017-04-01 2020-08-04 中国工商银行股份有限公司 Entity abbreviation generation method and device
CN106991085A (en) * 2017-04-01 2017-07-28 中国工商银行股份有限公司 The abbreviation generation method and device of a kind of entity
CN107291692B (en) * 2017-06-14 2020-12-18 北京百度网讯科技有限公司 Artificial intelligence-based word segmentation model customization method, device, equipment and medium
CN107291692A (en) * 2017-06-14 2017-10-24 北京百度网讯科技有限公司 Method for customizing, device, equipment and the medium of participle model based on artificial intelligence
CN109145282A (en) * 2017-06-16 2019-01-04 贵州小爱机器人科技有限公司 Punctuate model training method, punctuate method, apparatus and computer equipment
CN109145282B (en) * 2017-06-16 2023-11-07 贵州小爱机器人科技有限公司 Sentence-breaking model training method, sentence-breaking device and computer equipment
CN107391486A (en) * 2017-07-20 2017-11-24 南京云问网络技术有限公司 A kind of field new word identification method based on statistical information and sequence labelling
CN108509425B (en) * 2018-04-10 2021-08-24 中国人民解放军陆军工程大学 A Novelty-based Chinese New Word Discovery Method
CN108509425A (en) * 2018-04-10 2018-09-07 中国人民解放军陆军工程大学 Chinese new word discovery method based on novelty
CN108804413A (en) * 2018-04-28 2018-11-13 百度在线网络技术(北京)有限公司 The recognition methods of text cheating and device
CN108776653A (en) * 2018-05-25 2018-11-09 南京大学 A kind of text segmenting method of the judgement document based on PageRank and comentropy
CN109492224B (en) * 2018-11-07 2024-05-03 北京金山数字娱乐科技有限公司 Vocabulary construction method and device
CN109492224A (en) * 2018-11-07 2019-03-19 北京金山数字娱乐科技有限公司 A kind of method and device of vocabulary building
CN109710947A (en) * 2019-01-22 2019-05-03 福建亿榕信息技术有限公司 Method and device for generating electric power professional thesaurus
CN109710947B (en) * 2019-01-22 2021-09-07 福建亿榕信息技术有限公司 Method and device for generating electric power professional thesaurus
CN110175331A (en) * 2019-05-29 2019-08-27 三角兽(北京)科技有限公司 Recognition methods, device, electronic equipment and the readable storage medium storing program for executing of technical term
CN111090742A (en) * 2019-12-19 2020-05-01 东软集团股份有限公司 Question and answer pair evaluation method and device, storage medium and equipment
CN111090742B (en) * 2019-12-19 2024-05-17 东软集团股份有限公司 Question-answer pair evaluation method, question-answer pair evaluation device, storage medium and equipment
CN115495507B (en) * 2022-11-17 2023-03-24 江苏鸿程大数据技术与应用研究院有限公司 Engineering material information price matching method, system and storage medium
CN115495507A (en) * 2022-11-17 2022-12-20 江苏鸿程大数据技术与应用研究院有限公司 Engineering material information price matching method, system and storage medium
CN116702786A (en) * 2023-08-04 2023-09-05 山东大学 Chinese professional term extraction method and system integrating rules and statistical features
CN116702786B (en) * 2023-08-04 2023-11-17 山东大学 Chinese professional term extraction method and system integrating rules and statistical features

Also Published As

Publication number Publication date
CN103049501B (en) 2016-08-03

Similar Documents

Publication Publication Date Title
CN103049501B (en) Based on mutual information and the Chinese domain term recognition method of conditional random field models
Li et al. Twiner: named entity recognition in targeted twitter stream
CN107526799B (en) A Deep Learning-Based Knowledge Graph Construction Method
CN106997382B (en) Automatic labeling method and system for innovative creative labels based on big data
CN107133213B (en) A method and system for automatic extraction of text summaries based on algorithm
WO2021068339A1 (en) Text classification method and device, and computer readable storage medium
CN104572892B (en) A Text Classification Method Based on Recurrent Convolutional Network
CN102591988B (en) Short text classification method based on semantic graphs
WO2017167067A1 (en) Method and device for webpage text classification, method and device for webpage text recognition
CN106095753B (en) A kind of financial field term recognition methods based on comentropy and term confidence level
CN108710611B (en) A short text topic model generation method based on word network and word vector
CN110598203A (en) A method and device for extracting entity information of military scenario documents combined with dictionaries
CN106845358B (en) Method and system for feature recognition of handwritten character images
CN106372061A (en) Short text similarity calculation method based on semantics
CN103970729A (en) Multi-subject extracting method based on semantic categories
CN103970730A (en) Method for extracting multiple subject terms from single Chinese text
CN102567308A (en) Information processing feature extracting method
CN110705292B (en) Entity name extraction method based on knowledge base and deep learning
CN110347701B (en) A Target Type Identification Method for Entity Retrieval Query
CN108376133A (en) The short text sensibility classification method expanded based on emotion word
CN106227756A (en) A kind of stock index forecasting method based on emotional semantic classification and system
CN102737112B (en) Concept Relevance Calculation Method Based on Representational Semantic Analysis
CN105868347A (en) Tautonym disambiguation method based on multistep clustering
CN108038099A (en) Low frequency keyword recognition method based on term clustering
CN105912525A (en) Sentiment classification method for semi-supervised learning based on theme characteristics

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20160803

Termination date: 20181211