CN103049501A

CN103049501A - Chinese domain term recognition method based on mutual information and conditional random field model

Info

Publication number: CN103049501A
Application number: CN2012105287348A
Authority: CN
Inventors: 彭琳; 刘宗田; 杨林楠; 张立敏
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2012-12-11
Filing date: 2012-12-11
Publication date: 2013-04-17
Anticipated expiration: 2032-12-11
Also published as: CN103049501B

Abstract

The invention discloses a Chinese domain term recognition method based on mutual information and a conditional random field model. The Chinese domain term recognition method includes the following steps: (1) gathering domain text corpus and marking all the punctuations, spaces, numbers, ASSCII (American Standard Code for Information Interchange) characters and characters except Chinese characters in the corpus; (2) setting character strings and computing the mutual information values of the character strings, (3) computing the left comentropy and the right comentropy of every character string, (4) defining character string evaluation function, setting evaluation function threshold, computing the evaluation function values of every character string, determining that every character string is a word, comparing in sequence the evaluation function value of the former character with the evaluation function value of the latter character in the character string and segmenting character meaning character strings one by one, (5) utilizing conditional random fields to train a conditional random field model and recognizing domain terms with the conditional random field model. When the Chinese domain term recognition method is used to recognize terms, the data sparsity of legitimate terms is overcome, the amount of calculation of conditional random fields is reduced, and the accuracy of the Chinese domain term recognition is improved.

Description

Chinese domain term recognition method based on mutual information and conditional random field model

技术领域 technical field

本发明涉及的是一种基于互信息和条件随机场模型的中文领域术语识别方法，属于信息技术领域。 The invention relates to a method for recognizing Chinese domain terms based on mutual information and a conditional random field model, which belongs to the field of information technology.

背景技术 Background technique

国家标准GB/T15237.1-2000《术语工作词汇》的定义，术语是特定专业领域中一般概念的词语指称，是在一个学科领域内使用、表示该学科领域内的概念或关系的词或词组。术语可以分为日常生活中使用的一般性术语和特定领域中使用的领域术语。一般性术语多是按人们的生活和工作习惯形成的，不要求它在概念的表达上严格准确，其含义往往比较模糊；领域术语是对一个专业概念的系统性、概括性的描述，不允许模棱两可，每一个专业术语表达的概念都必须准确无误，不能因使用人的不同而不同。 The definition of the national standard GB/T15237.1-2000 "Terminology Working Vocabulary" is that a term refers to a term referring to a general concept in a specific professional field, and is a word or phrase used in a subject area to express a concept or relationship in the subject area . Terminology can be divided into general terms used in daily life and domain terms used in specific fields. General terms are mostly formed according to people's living and working habits, and they are not required to be strictly accurate in the expression of concepts, and their meanings are often vague; field terms are systematic and general descriptions of a professional concept, and are not allowed Ambiguity, the concept expressed by each technical term must be accurate and cannot be different due to different users.

领域术语识别是指从特定的科学或技术领域的语料库中抽出专业领域术语。领域术语自动识别作为信息抽取的重要内容,在自然语言处理领域有着广泛的应用，对于提高领域文本索引与检索、文本挖掘、本体构建、文本分类和聚类、潜在语义分析等的处理精度有着重要的意义。现有的中文文本信息中的领域术语识别方法主要有： Domain term recognition refers to the extraction of professional domain terms from a corpus in a specific scientific or technical field. As an important content of information extraction, automatic recognition of domain terms is widely used in the field of natural language processing. It is important for improving the processing accuracy of domain text indexing and retrieval, text mining, ontology construction, text classification and clustering, and latent semantic analysis. meaning. The existing domain term recognition methods in Chinese text information mainly include:

（1）基于统计方法的中文领域术语识别方法，主要思想是利用领域术语内部各组成成分之间较高的关联程度以及术语的领域特征信息来抽取领域术语。基于统计的方法一般流程是: 首先利用统计学或信息论中的方法,建立起各种统计信息，并根据统计结果,确定比较准确的种子词；然后在此基础上不断扩展,获取最终的领域术语。词语频率、均值和方差是比较常用的统计方法,更多的学者使用假设检验的方法, 主要有T检验、卡方检验、对数似然比、点互信息等。用统计方法识别领域术语,不需要句法、语义上的信息,不局限于某一专门领域,也不依赖任何资源,通用性较强。 (1) A method for identifying Chinese domain terms based on statistical methods. The main idea is to extract domain terms by using the higher degree of correlation between the components of the domain term and the domain characteristic information of the term. The general process of the method based on statistics is as follows: First, use the methods in statistics or information theory to establish various statistical information, and determine the more accurate seed words according to the statistical results; then continue to expand on this basis to obtain the final domain terms . Word frequency, mean and variance are relatively commonly used statistical methods, and more scholars use hypothesis testing methods, mainly including T test, chi-square test, log likelihood ratio, point mutual information, etc. Using statistical methods to identify domain terms does not require syntactic and semantic information, is not limited to a specific field, does not depend on any resources, and has strong versatility.

其中，基于统计的互信息算法应用最为广泛。例如有文章报道，其题目为“基于互信息的中文术语抽取系统”（该文作者是：张锋许云侯艳樊孝忠，发表于2005年出版的《计算机应用研究》第22卷第5期第72-73，77页），该文公开了一种中文术语自动抽取系统，该系统首先基于互信息计算字串的内部结合强度，从而得到术语候选集；接着从术语候选集中去除基本词，并利用普通词语搭配前缀、后缀信息进一步过滤；最后对术语候选进行词法分析，利用术语的词性构成规则进行判别，得到最终的术语抽取结果。实验结果表明，利用互信息算法对术语抽取的准确率为72.19％，召回率为77.98％，F测量值为74.97％。例如有文献报道，“C值和互信息相结合的术语抽取”（作者是：梁颖红张文静张有承，发表于2010年出版的《计算机应用与软件》第27卷第4期第108-110页），该文公开了一种将C值和互信息相结合的术语抽取方法，该方法提出综合C-value参数在长术语抽取方面具有优势，实验结果表明，该方法对长术语抽取的准确率为75.7％，召回率为68.4％，F测量值为71.9％，高于相同语料下的其他方法。但是该算法性能直接依赖于语料库的规模和候选领域术语的词频,针对有些低频率候选术语也可能是合法术语的数据稀疏问题难以解决，所以单纯利用互信息算法对领域术语进行识别，识别的准确率、召回率以及F测量值均难以达到80%以上，很难获得理想的识别效果； Among them, the mutual information algorithm based on statistics is the most widely used. For example, there is an article report titled "Chinese Term Extraction System Based on Mutual Information" (the author of this article is: Zhang Feng, Xu Yun, Hou Yan, Fan Xiaozhong, published in "Computer Application Research", Volume 22, Issue 5, Issue 2005 72-73, p. 77), which discloses an automatic Chinese term extraction system. The system first calculates the internal combination strength of word strings based on mutual information to obtain a term candidate set; then removes basic words from the term candidate set, and Use common words with prefix and suffix information to further filter; finally, perform lexical analysis on the term candidates, use the part-of-speech composition rules of the terms to judge, and obtain the final term extraction results. The experimental results show that the accuracy rate of term extraction using mutual information algorithm is 72.19%, the recall rate is 77.98%, and the F measurement value is 74.97%. For example, there is a literature report, "Term Extraction Combining C Value and Mutual Information" (authors are: Liang Yinghong, Zhang Wenjing, Zhang Youcheng, published in "Computer Application and Software", Volume 27, Issue 4, pp. 108-110, published in 2010), This paper discloses a term extraction method that combines C-value and mutual information. This method proposes that the comprehensive C-value parameter has advantages in long term extraction. Experimental results show that the accuracy of this method for long term extraction is 75.7% %, the recall rate is 68.4%, and the F-measure value is 71.9%, which are higher than other methods under the same corpus. However, the performance of this algorithm directly depends on the size of the corpus and the word frequency of candidate domain terms. It is difficult to solve the problem of data sparseness that some low-frequency candidate terms may also be legal terms. The rate, recall rate and F measurement value are all difficult to reach more than 80%, and it is difficult to obtain the ideal recognition effect;

（2）基于机器学习的中文领域术语识别方法的主要步骤为: 采用手工或半自动方式构建训练语料, 根据某种机器学习算法对训练语料学习生成模型,然后再利用模型对测试语料进行领域术语抽取实验,以验证本算法的有效性。目前已用于中文领域术语识别的机器学习理论主要包括决策树、支持向量机、隐马尔科夫模型、最大熵模型、最大熵马尔科夫模型和条件随机场算法等。基于机器学习的术语识别方法无需专家的领域知识和语言知识, 实现可行性大, 在考虑多种术语特征的情况下可以得到较好的识别或抽取效果。 (2) The main steps of the machine learning-based Chinese domain term recognition method are: Construct the training corpus manually or semi-automatically, learn and generate a model from the training corpus according to a certain machine learning algorithm, and then use the model to extract domain terms from the test corpus Experiments are carried out to verify the effectiveness of this algorithm. The machine learning theories that have been used for term recognition in the Chinese field mainly include decision trees, support vector machines, hidden Markov models, maximum entropy models, maximum entropy Markov models, and conditional random field algorithms. The term recognition method based on machine learning does not require expert domain knowledge and language knowledge, and it is highly feasible to implement. It can obtain better recognition or extraction results when considering multiple term features.

目前，基于机器学习的中文领域术语识别方法中条件随机场模型应用最为广泛。例如有文献报道，“一种中医名词术语自动抽取方法”（作者是：张五辈白宇王裴岩张桂平，发表于2011年出版的《沈阳航空航天大学学报》第28卷第1期第72-75页），该文公开了一种针对中医领域的基于条件随机场的术语抽取方法，该方法将中医领域术语抽取看作一个序列标注问题，将中医领域术语分布的特征量化作为训练的特征，利用CRF工具包训练出一个领域术语模型，然后利用该模型进行术语抽取。选择《名医类案》作为中医领域文本进行术语抽取实验，准确率达到83.11％，召回率达到81.04％，F测量值达到82.06％。以及文章“采用CRF技术的军事情报术语自动抽取研究”（作者是：贾美英杨炳儒郑德权杨靖，发表于2009年出版的《计算机工程与应用》第45卷第32期第126-129页），该文公开了一种针对军事情报领域的基于条件随机场的术语抽取方法，该方法将领域术语识别看作一个序列标注问题，将领域术语分布的特征量化作为训练的特征，利用CRF工具包训练出一个领域术语特征模板，然后利用该模板进行领域术语抽取。实验表明，该方法对军事情报领域术语的识别结果良好，准确率可达到73.24％，召回率达到69.57％，F测量值达到71.36％。 At present, the conditional random field model is the most widely used in the recognition method of Chinese domain terms based on machine learning. For example, there is a literature report, "A method for automatic extraction of terminology in traditional Chinese medicine" (authors are: Zhang Wudai, Bai Yu, Wang Peiyan, Zhang Guiping, published in "Journal of Shenyang Aerospace University", Volume 28, Issue 1, Issue 72, published in 2011 -75 pages), this paper discloses a term extraction method based on conditional random field for the field of traditional Chinese medicine. , use the CRF toolkit to train a domain term model, and then use the model for term extraction. Selecting "Famous Doctor Class Cases" as the text in the field of traditional Chinese medicine for term extraction experiments, the accuracy rate reached 83.11%, the recall rate reached 81.04%, and the F measurement value reached 82.06%. And the article "Research on Automatic Extraction of Military Intelligence Terminology Using CRF Technology" (Authors: Jia Meiying, Yang Bingru, Zheng Dequan, Yang Jing, published in "Computer Engineering and Application", Volume 45, Issue 32, Pages 126-129, published in 2009), This paper discloses a conditional random field-based term extraction method for the military intelligence field. This method regards domain term recognition as a sequence labeling problem, quantifies the feature distribution of domain terms as the training feature, and uses the CRF toolkit to train A domain term feature template is generated, and then the domain term extraction is performed using the template. Experiments show that the method has good recognition results for terms in the field of military intelligence, with an accuracy rate of 73.24%, a recall rate of 69.57%, and an F-measurement value of 71.36%.

利用条件随机场算法进行领域术语识别时，训练语料基本上都为手动和半自动标注的，人为参与度都高，工作量大，导致普遍识别量不大，制约了该算法的识别精度和应用。同时，需要先利用通用的分词工具对语料进行分词，然后再对分词后的语料进行条件随机场训练和测试，最终才能实现术语的识别。所以利用条件随机场算法进行领域术语识别的前提是，假设现有的通用分词工具可以对该领域的词汇进行准确地分词，并认为领域术语比分词工具所分的词粒度大。但是，由于专业领域术语与普通词汇存在差距，用一般性分词工具很难实现对专业领域语料的准确分词。因此，目前互信息和条件随机场方法在领域术语识别过程中自动识别程度较低，且识别精度不高。 When the conditional random field algorithm is used for domain term recognition, the training corpus is basically manually and semi-automatically annotated. The degree of human participation is high and the workload is heavy, resulting in a small amount of general recognition, which restricts the recognition accuracy and application of the algorithm. At the same time, it is necessary to use general-purpose word segmentation tools to segment the corpus, and then conduct conditional random field training and testing on the segmented corpus to finally realize term recognition. Therefore, the premise of using conditional random field algorithm to identify domain terms is assuming that the existing general word segmentation tools can accurately segment the vocabulary in the field, and it is believed that the domain term is larger than the word granularity segmented by the segmentation tool. However, due to the gap between professional field terminology and common vocabulary, it is difficult to achieve accurate word segmentation of professional field corpus with general word segmentation tools. Therefore, the current mutual information and conditional random field methods have a low degree of automatic recognition in the process of field term recognition, and the recognition accuracy is not high.

发明内容 Contents of the invention

鉴于以上所述现有技术存在的问题，本发明的目的是提供一种基于互信息和条件随机场模型的中文领域术语识别方法，该方法在术语识别时，不仅能克服合法术语的数据稀疏，降低了条件随机场算法的运算量，而且能够提高中文领域术语识别精度。 In view of the problems in the prior art described above, the purpose of the present invention is to provide a method for identifying Chinese domain terms based on mutual information and conditional random field models, which can not only overcome the data sparseness of legal terms when identifying terms, but also The computational load of the conditional random field algorithm is reduced, and the recognition accuracy of Chinese domain terms can be improved.

为了达到上述目的，本发明采用下述技术方案： In order to achieve the above object, the present invention adopts following technical scheme:

本发明的基于互信息和条件随机场模型的中文领域术语识别方法，具体步骤如下： The Chinese domain term recognition method based on mutual information and conditional random field model of the present invention, concrete steps are as follows:

（1）、收集领域文本语料，对语料中所有的标点符号、空格、数字、ASCII字符以及汉字以外字符进行标记； (1) Collect domain text corpus and mark all punctuation marks, spaces, numbers, ASCII characters and characters other than Chinese characters in the corpus;

（2）、设置字串

Figure 2012105287348100002DEST_PATH_IMAGE001

，计算字串

的互信息值； (2), set the string

, to calculate the string

mutual information value;

（3）、计算字串

左右信息熵； (3), calculate the string

left and right information entropy;

（4）、定义字串

评价函数，设置评价函数阈值，计算各字串的评价函数值，确定字串

为词，依次比较该字串

中前一字

Figure 2012105287348100002DEST_PATH_IMAGE003

的评价函数值与后一字

评价函数值相比较，得到各字串

中对应的比值，其比值再与评价函数阈值比较，逐一对字义字串

分词； (4), define the string

evaluation function, set the evaluation function Threshold, calculate the evaluation function value of each string, determine the string

is a word, compare the strings in turn

Chinese character

The value of the evaluation function and the next word

Compare the evaluation function values to get each string

The corresponding ratio in , and its ratio is then compared with the evaluation function Threshold comparison, one-by-one for literal strings

Participle;

（5）、以词、词性、词的出现频率的随机场的训练特征，利用条件随机场方法训练出一个领域术语条件随机场模型，用该模型对进行领域术语识别。 (5) Using the training characteristics of the random field of word, part of speech, and word frequency, use the conditional random field method to train a field term conditional random field model, and use this model to identify field terms.

上述步骤（2）中所述的（2）设置字串

，计算字串

的互信息值，其计算公式如下： (2) setting string described in step (2) above

, to calculate the string

The mutual information value of , its calculation formula is as follows:

假设一个领域术语是由n个字组成，如果字串

为一个领域术语，那么字串

由

、

、

……

个字组成，字串

的互信息值计算公式如下： Assuming that a domain term is composed of n words, if the string

is a domain term, then the string

Depend on

,

...

composed of words, string

The formula for calculating the mutual information value is as follows:

（1）

(1)

其中，

表示一个由n个字组成的字串； in,

Represents a string consisting of n characters;

表示组成字串

的第i个字（i=1，2，3，…，n）； Indicates the composition of the string

The ith word of (i=1, 2, 3, ..., n);

表示语料库中字

出现的频次；

Represents words in the corpus

frequency of occurrence;

表示语料库中字

出现的频次；

Represents words in the corpus

frequency of occurrence;

表示语料库中字出现的频次；

Represents words in the corpus frequency of occurrence;

表示语料库中字出现的频次； Represents words in the corpus frequency of occurrence;

表示字

、

、

、…、

同时出现的频次；

Indicates the word

,

,...,

frequency of simultaneous occurrences;

表示字串

中所有字与字之间的互信息。 Represents a string

Mutual information between all words in .

上述步骤（3）中所述的计算左右信息熵，其计算公式如下： The calculation formula for calculating the left and right information entropy described in the above step (3) is as follows:

左信息熵计算公式为：

（2） The left information entropy calculation formula is:

(2)

右信息熵计算公式为：

（3） The right information entropy calculation formula is:

(3)

其中，

表示为给定的一个由n个字组成的字串； in,

Represented as a given string consisting of n characters;

和分别表示出现在

左侧和右则时的条件概率；

and Respectively Appear in

Conditional probabilities for left and right time;

和

表示

左边和右边所有出现的词集合；

and

express

The set of all occurrences of words on the left and right;

表示组成字串

的第i个字，其中，i=1，2，3，…，n 。

Indicates the composition of the string

The i-th word of , where i=1, 2, 3,..., n.

上述步骤（4）中所述的定义字串W评价函数，并利用评价函数对语料进行分词，是指利用步骤（2）和步骤（3）计算得到的互信息和左右信息熵值，对语料中的字串

为词的可信度进行评价，判断该字串是否为词，其中，字串W评价函数计算公式如下： The definition of the word string W evaluation function described in the above step (4), and the use of the evaluation function to segment the corpus refers to the use of the mutual information and left and right information entropy values calculated in steps (2) and (3), and the corpus string in

Evaluate the credibility of the word, and judge whether the word string is a word, wherein, the word string W evaluation function calculation formula is as follows:

（4）

(4)

其中，

表示为给定的一个由n个字组成的字串； in,

Represented as a given string consisting of n characters;

表示字串

中字符之间的互信息值；

Represents a string

Mutual information value between characters in ;

表示字串

的左信息熵值；

Represents a string

The left information entropy value of ;

表示字串的右信息熵值；

Represents a string The right information entropy value of ;

为平衡因子，用以调节信息熵与互信息值在字串

评价函数中的权值。

is a balance factor, used to adjust the value of information entropy and mutual information in the string

Weights in the evaluation function.

上述步骤（5）中所述的以词、词性、词的出现频率的随机场的训练特征，利用条件随机场方法训练出一个领域术语条件随机场模型，利用该模型对进行领域术语识别，其操作步骤如下： In the above step (5), the training characteristics of the random field of word, part of speech, and word occurrence frequency are used to train a field term conditional random field model by using the conditional random field method, and the field term recognition is carried out by using this model. The operation steps are as follows:

（51）、以词本身、词性、词的出现频率在语料中进行标注； (51), mark in the corpus with the word itself, part of speech, and frequency of occurrence of the word;

（52）、利用CRF++ 0.53工具包对已标注的特征序列训练，获取条件随机场参数，该条件随机场参数为该领域术语识别的条件随机场模型； (52), use the CRF++ 0.53 toolkit to train the marked feature sequence to obtain the conditional random field parameter, which is the conditional random field model for term recognition in this field;

（53）、用领域术语识别的的条件随机场模型对测试已标注的特征序列的领域术语识别。 (53). Using the conditional random field model of field term recognition to test the field term recognition of the marked feature sequence.

本发明的基于互信息和条件随机场模型的中文领域术语识别方法与现有技术相比较，具有以下效果： Compared with the prior art, the Chinese domain term recognition method based on mutual information and conditional random field model of the present invention has the following effects:

（1）、该方法将基于统计和机器学习的两类术语识别方法有机地结合在一起，有效的解决了单纯利用统计方法进行术语识别时的数据稀疏问题； (1) This method organically combines two types of term recognition methods based on statistics and machine learning, and effectively solves the problem of data sparseness when simply using statistical methods for term recognition;

（2）、该方法利用互信息算法对语料进行分词和标注，实现了语料的自动标注； (2) This method uses the mutual information algorithm to segment and label the corpus, realizing the automatic labeling of the corpus;

（3）、该方法仅采用了3个最为普通的词特征，作为条件随机场方法的训练，使该方法具有较强的领域通用性，有效地降低了条件随机场的运算量，减少了条件随机场的训练时间。 (3) This method only uses the three most common word features as the training of the conditional random field method, which makes the method have strong field versatility, effectively reduces the calculation amount of the conditional random field, and reduces the condition Random field training time.

附图说明 Description of drawings

图1为本发明的基于互信息和条件随机场模型的中文领域术语识别方法的流程图； Fig. 1 is the flow chart of the Chinese domain term recognition method based on mutual information and conditional random field model of the present invention;

图2是图1中步骤(4)的流程图； Fig. 2 is the flowchart of step (4) among Fig. 1;

图3是图1中步骤(5)的流程图。 Fig. 3 is a flowchart of step (5) in Fig. 1 .

具体实施方式 Detailed ways

下面结合附图和具体实施方式对本发明作进一步详细的描述。 The present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.

本实施例以植物——竹子的领域术语识别作为实例对本发明进行说明，但不用来限制本发明的范围。 In this embodiment, the field term recognition of plant-bamboo is taken as an example to illustrate the present invention, but it is not used to limit the scope of the present invention.

参照图1，本发明的基于互信息和条件随机场模型的中文领域术语识别方法，包括如下步骤： With reference to Fig. 1, the Chinese field term recognition method based on mutual information and conditional random field model of the present invention, comprises the steps:

（1）、收集领域文本语料，对语料中所有的标点符号、空格、数字、ASCII字符以及汉字以外字符进行标记。 (1) Collect domain text corpus, and mark all punctuation marks, spaces, numbers, ASCII characters and characters other than Chinese characters in the corpus.

例如，本实例选取《中国植物志》第9卷竹亚科的电子书稿作为领域文本语料。 For example, this example selects the e-book manuscript of the ninth volume of "Flora of China" as the domain text corpus.

首先，将语料按4:1的比例随机地划分为：训练语料和测试语料两部分； First, the corpus is randomly divided into two parts according to the ratio of 4:1: training corpus and test corpus;

然后，检索出语料中所有标点符号、空格、数字、ASCII字符以及汉字以外字符，在上述字符前、后分别用“//”符号进行标记； Then, retrieve all punctuation marks, spaces, numbers, ASCII characters and characters other than Chinese characters in the corpus, and mark them with "//" symbols before and after the above characters;

最后，参照汉语词性表，对所有代词、叹词、助词和虚词，以及首字为“和、有、的、得、将、把、从、了、是、则、在、每、这、该、给、所、使、为、不、着、了、很、该、与、得、的”词的前、后分别用“//”符号进行标记。 Finally, referring to the Chinese Parts of Speech Table, for all pronouns, interjections, auxiliary words and function words, as well as the initial characters of "he, you, de, get, will, put, from, got, is, then, in, every, this, the , Give, So, Make, For, Not, Write, Got, Very, Should, With, Get, The front and back of the words are marked with "//" symbols respectively.

（2）、设置字串

，计算字串的互信息值，其计算公式如下： (2), set the string

, to calculate the string The mutual information value of , its calculation formula is as follows:

假设一个领域术语是由n个字组成，如果字串

为一个领域术语，那么字串

由

、

、

……

个字组成，字串

is a domain term, then the string

Depend on

,

...

composed of words, string

The formula for calculating the mutual information value is as follows:

（1）

(1)

其中，

表示一个由n个字组成的字串； in,

Represents a string consisting of n characters;

表示组成字串

的第i个字，其中，i=1，2，3，…，n；

Indicates the composition of the string

The i-th word of , where i=1, 2, 3,..., n;

表示语料库中字出现的频次；

Represents words in the corpus frequency of occurrence;

表示语料库中字出现的频次；

Represents words in the corpus frequency of occurrence;

表示语料库中字

出现的频次；

Represents words in the corpus

frequency of occurrence;

表示语料库中字出现的频次；

Represents words in the corpus frequency of occurrence;

表示字

、、

、…、同时出现的频次； Indicates the word

, ,

,..., frequency of simultaneous occurrences;

表示字串

中所有字与字之间的互信息。 Represents a string

Mutual information between all words in .

由于本发明认为中文领域术语的长度不大于4个字，并且认为中文领域术语中间不可能出现标点符号、空格、数字、ASCII字符以及汉字以外字符，同时也不可能出再叹词、虚词、指标代词等词，所以本发明对语料文本中所有字分别计算其2-word、3-word、4-word的互信息值，当遇到标记符“//”停止计算，其互信息值的计算公式参见上述发明内容中步骤（2）的公式（1）、（2）、（3）。 Because the present invention thinks that the length of the term in the Chinese field is not more than 4 characters, and thinks that punctuation marks, spaces, numbers, ASCII characters, and characters other than Chinese characters cannot appear in the middle of the Chinese field term, and it is also impossible to appear interjections, function words, and indicators Pronouns and other words, so the present invention calculates the mutual information values of 2-word, 3-word, and 4-word for all words in the corpus text, and stops calculation when the marker "//" is encountered, and the calculation of the mutual information value For formulas, refer to formulas (1), (2) and (3) in step (2) in the above summary of the invention.

例如：语料“边缘被流苏状毛//，//”,其中2-word包括：“边缘”、“缘被”、“被流”、“流苏”、“苏状”和“状毛”；3-word包括：“边缘被”、“缘被流”、“被流苏”、“流苏状”和“苏状毛”；4-word包括：“边缘被流”、“缘被流苏”、“被流苏状”和“流苏状毛”，部份互信息计算结果为：

，

，

，

，； For example: the corpus "edge is fringed hair //, //", where 2-words include: "edge", "marginal cover", "be flowed", "tassel", "su-like" and "like hair"; 3-word includes: "edge is flowed", "edge is flowed", "be tasseled", "fringe-like" and "su-like hair"; 4-word includes: "edge is flowed", "edge is tasseled", "Fringed" and "fringed hair", the partial mutual information calculation results are:

,

, ;

（3）、计算字串

左右信息熵，其计算公式如下： (3), calculate the string

Left and right information entropy, its calculation formula is as follows:

左信息熵计算公式为： The left information entropy calculation formula is:

（2）

(2)

右信息熵计算公式为： The right information entropy calculation formula is:

（3）

(3)

其中，

表示为给定的一个由n个字组成的字串； in,

Represented as a given string consisting of n characters;

和

分别表示

出现在

左侧和右则时的条件概率；

and

Respectively

Appear in

Conditional probabilities for left and right time;

和

表示

左边和右边所有出现的词集合； and

express

The set of all occurrences of words on the left and right;

表示组成字串

的第i个字，其中，i=1，2，3，…，n。 Indicates the composition of the string

The i-th word of , where i=1, 2, 3,..., n.

判断一个字串是否为词，不仅要考虑字串内部字与字之间的结合紧密度，即字之间互信息的大小；同时，还要考虑字串之间的边界自由程度，即在字串边界出现的邻接字的种类越多，认为字串左右信息熵越大，也就是字串边界的自由度越大，其左右信息熵的计算公式参见上述发明内容中步骤（3）的公式（2）、（3）。 To judge whether a string is a word or not, not only the degree of combination between characters within the string, that is, the size of the mutual information between words, but also the degree of freedom of boundaries between strings, that is, the The more types of adjacent words appearing on the string boundary, the greater the left and right information entropy of the string is considered, that is, the greater the degree of freedom of the word string boundary, and the calculation formula of the left and right information entropy refers to the formula of step (3) in the above-mentioned content of the invention ( 2), (3).

例如：语料“边缘被流苏状毛//，//”中，部份左信息熵计算结果为：

，

,

，

，

，

；右信息熵计算结果为：

，，

，，

，

； For example: in the corpus "the edge is fringed hair //, //", the calculation result of part of the left information entropy is:

,

; The calculation result of right information entropy is:

, ,

,

;

（4）、定义字串

评价函数，设置评价函数

阈值，计算各字串的评价函数值，确定字串为词，依次比较该字串

中前一字

的评价函数值与后一字

评价函数值相比较，得到各字串

中对应的比值，其比值再与评价函数

阈值比较，逐一对字义字串

分词，其操作步骤如下： (4), define the string

evaluation function, set the evaluation function

Threshold, calculate the evaluation function value of each string, determine the string is a word, compare the strings in turn

Chinese character

The value of the evaluation function and the next word

Compare the evaluation function values to get each string

The corresponding ratio in , and its ratio is then compared with the evaluation function

Threshold comparison, one-by-one for literal strings

Word segmentation, the operation steps are as follows:

（41）、定义字串

评价函数，其计算表达式为： (41), define the string

Evaluation function, its calculation expression is:

（4）

(4)

其中，

表示为给定的一个由n个字组成的字串； in,

Represented as a given string consisting of n characters;

表示字串

中字符之间的互信息值；

Represents a string

Mutual information value between characters in ;

表示字串

的左信息熵值；

Represents a string

The left information entropy value of ;

表示字串的右信息熵值；

Represents a string The right information entropy value of ;

为平衡因子，用以调节信息熵与互信息值在评价函数中的权值。 It is a balance factor, which is used to adjust the weight of information entropy and mutual information value in the evaluation function.

（42）、分别计算评价函数数值，确定字串

为词。 (42), respectively calculate the value of the evaluation function, determine the string

for words.

根据上述发明内容中的步骤（4）的评价函数公式分别计算所有字串的评价函数值，其中

取0.5，并认为当评价函数大于阈值0.8时，该字串

为词， Calculate the evaluation function values of all strings according to the evaluation function formula of step (4) in the above-mentioned summary of the invention, wherein

Take 0.5, and think that when the evaluation function When greater than the threshold 0.8, the string

for words,

例如：语料“边缘被流苏状毛//，//”，部份评价函数计算结果为：

，，

，

，

，

； For example: in the corpus "the edge is fringed hair //, //", the calculation result of some evaluation functions is:

, ,

,

;

（43）、依次比较上述字串

中前一字

的评价函数值与后一字

评价函数值相比，得到各字串

中对应的比值“？”，其比值再与评价函数

阈值比较，逐一对字义字串

分词。 (43), compare the above strings in turn

Chinese character

The value of the evaluation function and the next word

Evaluation function values are compared to get each string

The corresponding ratio "?" in, and its ratio is then compared with the evaluation function

Threshold comparison, one-by-one for literal strings

Participle.

例如，首先从语料的第一个字开始，分别选取长度为4、3、2、1的子字串，记作

、

、

和

； For example, firstly, starting from the first word of the corpus, select substrings with lengths of 4, 3, 2, and 1 respectively, and write them as

,

and

;

然后，对字串

和

的评价函数进行比较，如果

，认为字串为新词，d在字串

前后分别以符号“*”进行标注；反之，认为字串

不是新词，则其丢弃尾部的最后一个字，对

和

的评价函数进行比较，如果

，认为字串

为新词，在字串

前后分别以符号“*”进行标注；反之，认为字串

不是新词，其丢弃尾部的最后一个字对

的评价函数进行判断，如果

，认为字串

为新词，在字串

前后分别以符号“*”进行标注；反之，认为字串

为新词，在字串

前后分别以符号“*”进行标注；只要有新词被标注，就从新词后的第一个字开始，再分别选取长度为4、3、2、1的子字串，记作

、、

和，重新进行评价函数的比较，当遇到“//”符号跳过。如此反复, 直至所以语料处理完为止，例如：语料“边缘被流苏状毛//，//”,首先，从第一个字开始截取长度分别为4、3、2、1的子字串，即：“边缘被流”、“边缘被”、“边缘”和“边”；然后，首先判断

是否大于等于0.8,根据步骤（41）评价函数的计算结果，可知

小于0.8，即字串“边缘被流”不是新词；然后，判断

是否大于等于0.8,根据步骤（41）评价函数的计算结果，可知

小于0.8，故字串“边缘被”也不是新词；接着判断

是否大于等于0.8,根据步骤（41）评价函数的计算结果，可知

大于0.8，故字串“边缘”是新词；当有判断出新词后，从新词后的第一个字开始再选取4、3、2、1个字串，作为新一轮的作

、

、

和，即“被流苏状”、“波流苏”、“被流”和“被”，再重复以上步骤进行比较，当遇到“//”符号跳过，直到结束，所以语料“边缘被流苏状毛//，//”，最后分词结果为“*边缘*被*流苏状*毛//,//” ； Then, for the string

and

The evaluation function is compared, if

, consider the string as a new word, d in the string

Mark with the symbol "*" before and after; otherwise, consider the string

is not a new word, it discards the last word at the end, for

and

The evaluation function is compared, if

, consider the string

for new words, in the string

Mark with the symbol "*" before and after; otherwise, consider the string

is not a new word, which discards the last word pair at the end

Evaluation function to judge, if

, consider the string

for new words, in the string

Mark with the symbol "*" before and after; otherwise, consider the string

for new words, in the string

The symbol "*" is used to mark the front and back respectively; as long as a new word is marked, start from the first word after the new word, and then select substrings with lengths of 4, 3, 2, and 1 respectively, and write them as

, ,

and , to re-comparison the evaluation function, and skip when the "//" symbol is encountered. Repeat this until all the corpus is processed, for example: the corpus "the edge is fringed hair //, //", first, start from the first word to intercept substrings with lengths of 4, 3, 2, 1 respectively, That is: "edge is flowed", "edge is", "edge" and "edge"; then, first judge

Whether it is greater than or equal to 0.8, according to the calculation result of the evaluation function in step (41), it can be known that

is less than 0.8, that is, the word string "edge is flowed" is not a new word; then, judge

is less than 0.8, so the word string "marginal quilt" is not a new word; then judge

is greater than 0.8, so the word string "edge" is a new word; when a new word is judged, select 4, 3, 2, and 1 word strings from the first word after the new word as a new round of writing

,

and , that is, "be fringed", "wave tassel", "be flowed" and "be", repeat the above steps for comparison, when encountering the "//" symbol skip until the end, so the corpus "edge is fringed" Mao//, //", the final word segmentation result is "*edge*be*fringed*hair//,//";

（5）、以词、词性、词的出现频率的随机场的训练特征，利用条件随机场训练出一个领域术语条件随机场模型，用该模型对进行领域术语识别，其操作步骤如下： (5) Using the training characteristics of the random field of word, part of speech, and word frequency, use the conditional random field to train a field term conditional random field model, and use this model to identify field terms. The operation steps are as follows:

（51）、以词本身、词性、词的出现频率在语料中进行标注，其具体如下： (51), mark in the corpus with the word itself, part of speech, and frequency of occurrence of the word, as follows:

依次对字义字串

分词标注特征序列，该词的标注的特征序列分别为：当前词本身；当前词的词性；当前词的出现频率，采用K-Means聚类方法，将上述当前词的出现频率分为10个等级，每个等级为一类，10个类分别表示为A、B、C、D、E、F、G、H、I、J、K，将已标注的特征序列分为：训练已标注的特征序列、测试已标注的特征序列两部份； literal string

The feature sequence of word segmentation, the feature sequence of the tag of the word is: the current word itself; the part of speech of the current word; the frequency of occurrence of the current word, using the K-Means clustering method to divide the frequency of the above current word into 10 levels , each level is a class, and the 10 classes are respectively represented as A, B, C, D, E, F, G, H, I, J, K, and the marked feature sequence is divided into: training marked features Sequence, test the two parts of the marked feature sequence;

（52）、利用CRF++ 0.53工具包对已标注的特征序列训练，获取条件随机场参数，条件随机场参数为领域术语识别的条件随机场模型； (52), use the CRF++ 0.53 toolkit to train the marked feature sequence, and obtain the conditional random field parameters, which are the conditional random field models for domain term recognition;

（53）、用领域术语识别的条件随机场模型对测试已标注的特征序列的领域术语识别，其具体如下： (53), use the conditional random field model of field term recognition to test the field term recognition of the marked feature sequence, which is as follows:

将测试已标注的特征序列输入到步骤（5.2）训练后获得领域术语识别的条件随机场模型，利用该条件随机场模型，计算出特征值，识别出领域术语，输出结果为识别出的领域术语，例如：语料“边缘被流苏状毛//，//”，最终识别出“边缘”和“流苏状”为领域术语。 Input the marked feature sequence of the test into the conditional random field model for field term recognition after training in step (5.2), use the conditional random field model to calculate the feature value, identify the field term, and output the recognized field term , For example: the corpus "edge is fringed hair //, //", and finally recognizes "edge" and "fringe" as domain terms.

以上为本发明的最佳实施方式，依据本发明公开的内容，本领域技术人员能够显而易见地想到一些雷同、替代方案，均应属于本发明的技术创新范围。 The above are the best implementation modes of the present invention. According to the disclosed content of the present invention, those skilled in the art can obviously think of some similarities and alternatives, which should all belong to the technical innovation scope of the present invention.

Claims

1. Chinese field term recognition methods based on mutual information and conditional random field models, concrete steps are as follows:

(1), the assembling sphere corpus of text, character beyond punctuation marks all in the language material, space, numeral, ascii character and the Chinese character is carried out mark;

(2), word string is set

, calculate word string

The mutual information value;

(3), calculate word string

Left and right sides information entropy;

(4), definition word string Evaluation function arranges evaluation function

Threshold value is calculated the evaluation function value of each word string, determines word string

Be word, successively this word string relatively

Middle prev word

Evaluation function value and a rear word

The evaluation function value is compared, and obtains each word string

The ratio of middle correspondence, its ratio again with evaluation function

Threshold ratio, one by one to meaning of word word string

Participle;

(5), with the training characteristics of the random field of the frequency of occurrences of word, part of speech, word, utilize the condition random field method to train a field term conditional random field models, with this model to carrying out field term identification.

2. the Chinese field term recognition methods based on mutual information and conditional random field models according to claim 1 is characterized in that, described in the above-mentioned steps (2) word string is set

, calculate word string

The mutual information value, its computing formula is as follows:

Suppose that a field term is comprised of n word, if word string

It is field term, so a word string

By

,

,

Individual word forms, word string

Mutual information value computing formula as follows:

（1）

Wherein,

Represent a word string that is formed by n word;

Expression forms word string

I word (i=1,2,3 ..., n);

Word in the expression corpus

The frequency that occurs;

Word in the expression corpus

The frequency that occurs;

Word in the expression corpus The frequency that occurs;

Word in the expression corpus

The frequency that occurs;

The expression word

, ,

...,

The frequency that occurs simultaneously;

The expression word string

In mutual information between all words and the word.

3. the Chinese field term recognition methods based on mutual information and conditional random field models according to claim 1 is characterized in that, the calculating left and right sides information entropy described in the above-mentioned steps (3), and its computing formula is as follows:

Left information entropy computing formula is:

(2)

Right information entropy computing formula is:

(3)

Wherein,

Be expressed as a given word string that is formed by n word;

With

Respectively expression Appear at

Left side and right conditional probability then the time;

With

Expression

The set of words that the left side and the right occur;

Expression forms word string

I word, wherein, i=1,2,3 ..., n.

4. the Chinese field term recognition methods based on mutual information and conditional random field models according to claim 1, it is characterized in that, definition word string W evaluation function described in the above-mentioned steps (4), and utilize evaluation function that language material is carried out participle, refer to the mutual information and the left and right sides information entropy that utilize step (2) and step (3) to calculate, to the word string in the language material

Estimate for the confidence level of word, judge whether this word string is word, and wherein, word string W evaluation function computing formula is as follows:

（4）

Wherein, Be expressed as a given word string that is formed by n word;

The expression word string

Mutual information value between the middle character;

The expression word string

Left information entropy;

The expression word string

Right information entropy;

Be balance factor, in order to regulate information entropy and mutual information value in word string

Weights in the evaluation function.

5. the Chinese field term recognition methods based on mutual information and conditional random field models according to claim 1, it is characterized in that, the training characteristics with the random field of the frequency of occurrences of word, part of speech, word described in the above-mentioned steps (5), utilize the condition random field method to train a field term conditional random field models, utilize this model to carrying out field term identification, its operation steps is as follows:

(51), the frequency of occurrences with word itself, part of speech, word marks in language material;

(52), utilize CRF++ 0.53 kit to the training of the characteristic sequence that marked, obtain the condition random field parameter, the conditional random field models that this condition random field parameter is identified for this field term;

(53), with field term identification the field term identification of conditional random field models characteristic sequence that test has been marked.