CN103049501A - Chinese domain term recognition method based on mutual information and conditional random field model - Google Patents
Chinese domain term recognition method based on mutual information and conditional random field model Download PDFInfo
- Publication number
- CN103049501A CN103049501A CN2012105287348A CN201210528734A CN103049501A CN 103049501 A CN103049501 A CN 103049501A CN 2012105287348 A CN2012105287348 A CN 2012105287348A CN 201210528734 A CN201210528734 A CN 201210528734A CN 103049501 A CN103049501 A CN 103049501A
- Authority
- CN
- China
- Prior art keywords
- word
- string
- word string
- evaluation function
- random field
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 42
- 238000011156 evaluation Methods 0.000 claims abstract description 48
- 230000006870 function Effects 0.000 claims description 45
- 238000012549 training Methods 0.000 claims description 16
- 238000012360 testing method Methods 0.000 claims description 10
- 238000004364 calculation method Methods 0.000 abstract description 22
- 238000000605 extraction Methods 0.000 description 18
- 238000004422 calculation algorithm Methods 0.000 description 11
- 230000011218 segmentation Effects 0.000 description 8
- 238000010801 machine learning Methods 0.000 description 6
- 239000000203 mixture Substances 0.000 description 5
- 238000005259 measurement Methods 0.000 description 4
- 238000007619 statistical method Methods 0.000 description 4
- 239000003814 drug Substances 0.000 description 3
- 238000002474 experimental method Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000004883 computer application Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 230000007774 longterm Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 102000016751 Fringe-like Human genes 0.000 description 1
- 108050006300 Fringe-like Proteins 0.000 description 1
- 239000011425 bamboo Substances 0.000 description 1
- 238000000546 chi-square test Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 230000008570 general process Effects 0.000 description 1
- 238000003064 k means clustering Methods 0.000 description 1
- 239000003550 marker Substances 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
Images
Landscapes
- Machine Translation (AREA)
Abstract
Description
技术领域 technical field
本发明涉及的是一种基于互信息和条件随机场模型的中文领域术语识别方法,属于信息技术领域。 The invention relates to a method for recognizing Chinese domain terms based on mutual information and a conditional random field model, which belongs to the field of information technology.
背景技术 Background technique
国家标准GB/T15237.1-2000《术语工作词汇》的定义,术语是特定专业领域中一般概念的词语指称,是在一个学科领域内使用、表示该学科领域内的概念或关系的词或词组。术语可以分为日常生活中使用的一般性术语和特定领域中使用的领域术语。一般性术语多是按人们的生活和工作习惯形成的,不要求它在概念的表达上严格准确,其含义往往比较模糊;领域术语是对一个专业概念的系统性、概括性的描述,不允许模棱两可,每一个专业术语表达的概念都必须准确无误,不能因使用人的不同而不同。 The definition of the national standard GB/T15237.1-2000 "Terminology Working Vocabulary" is that a term refers to a term referring to a general concept in a specific professional field, and is a word or phrase used in a subject area to express a concept or relationship in the subject area . Terminology can be divided into general terms used in daily life and domain terms used in specific fields. General terms are mostly formed according to people's living and working habits, and they are not required to be strictly accurate in the expression of concepts, and their meanings are often vague; field terms are systematic and general descriptions of a professional concept, and are not allowed Ambiguity, the concept expressed by each technical term must be accurate and cannot be different due to different users.
领域术语识别是指从特定的科学或技术领域的语料库中抽出专业领域术语。领域术语自动识别作为信息抽取的重要内容,在自然语言处理领域有着广泛的应用,对于提高领域文本索引与检索、文本挖掘、本体构建、文本分类和聚类、潜在语义分析等的处理精度有着重要的意义。现有的中文文本信息中的领域术语识别方法主要有: Domain term recognition refers to the extraction of professional domain terms from a corpus in a specific scientific or technical field. As an important content of information extraction, automatic recognition of domain terms is widely used in the field of natural language processing. It is important for improving the processing accuracy of domain text indexing and retrieval, text mining, ontology construction, text classification and clustering, and latent semantic analysis. meaning. The existing domain term recognition methods in Chinese text information mainly include:
(1)基于统计方法的中文领域术语识别方法,主要思想是利用领域术语内部各组成成分之间较高的关联程度以及术语的领域特征信息来抽取领域术语。基于统计的方法一般流程是: 首先利用统计学或信息论中的方法,建立起各种统计信息,并根据统计结果,确定比较准确的种子词;然后在此基础上不断扩展,获取最终的领域术语。词语频率、均值和方差是比较常用的统计方法,更多的学者使用假设检验的方法, 主要有T检验、卡方检验、对数似然比、点互信息等。用统计方法识别领域术语,不需要句法、语义上的信息,不局限于某一专门领域,也不依赖任何资源,通用性较强。 (1) A method for identifying Chinese domain terms based on statistical methods. The main idea is to extract domain terms by using the higher degree of correlation between the components of the domain term and the domain characteristic information of the term. The general process of the method based on statistics is as follows: First, use the methods in statistics or information theory to establish various statistical information, and determine the more accurate seed words according to the statistical results; then continue to expand on this basis to obtain the final domain terms . Word frequency, mean and variance are relatively commonly used statistical methods, and more scholars use hypothesis testing methods, mainly including T test, chi-square test, log likelihood ratio, point mutual information, etc. Using statistical methods to identify domain terms does not require syntactic and semantic information, is not limited to a specific field, does not depend on any resources, and has strong versatility.
其中,基于统计的互信息算法应用最为广泛。例如有文章报道,其题目为“基于互信息的中文术语抽取系统”(该文作者是:张锋 许云 侯艳 樊孝忠,发表于2005年出版的《计算机应用研究》第22卷第5期第72-73,77页),该文公开了一种中文术语自动抽取系统,该系统首先基于互信息计算字串的内部结合强度,从而得到术语候选集;接着从术语候选集中去除基本词,并利用普通词语搭配前缀、后缀信息进一步过滤;最后对术语候选进行词法分析,利用术语的词性构成规则进行判别,得到最终的术语抽取结果。实验结果表明,利用互信息算法对术语抽取的准确率为72.19% ,召回率为77.98% ,F测量值为74.97%。例如有文献报道,“C值和互信息相结合的术语抽取”(作者是:梁颖红 张文静 张有承,发表于2010年出版的《计算机应用与软件》第27卷第4期第108-110页),该文公开了一种将C值和互信息相结合的术语抽取方法,该方法提出综合C-value参数在长术语抽取方面具有优势,实验结果表明,该方法对长术语抽取的准确率为75.7%,召回率为68.4%,F测量值为71.9%,高于相同语料下的其他方法。但是该算法性能直接依赖于语料库的规模和候选领域术语的词频,针对有些低频率候选术语也可能是合法术语的数据稀疏问题难以解决,所以单纯利用互信息算法对领域术语进行识别,识别的准确率、召回率以及F测量值均难以达到80%以上,很难获得理想的识别效果;
Among them, the mutual information algorithm based on statistics is the most widely used. For example, there is an article report titled "Chinese Term Extraction System Based on Mutual Information" (the author of this article is: Zhang Feng, Xu Yun, Hou Yan, Fan Xiaozhong, published in "Computer Application Research", Volume 22,
(2)基于机器学习的中文领域术语识别方法的主要步骤为: 采用手工或半自动方式构建训练语料, 根据某种机器学习算法对训练语料学习生成模型,然后再利用模型对测试语料进行领域术语抽取实验,以验证本算法的有效性。目前已用于中文领域术语识别的机器学习理论主要包括决策树、支持向量机、隐马尔科夫模型、最大熵模型、最大熵马尔科夫模型和条件随机场算法等。基于机器学习的术语识别方法无需专家的领域知识和语言知识, 实现可行性大, 在考虑多种术语特征的情况下可以得到较好的识别或抽取效果。 (2) The main steps of the machine learning-based Chinese domain term recognition method are: Construct the training corpus manually or semi-automatically, learn and generate a model from the training corpus according to a certain machine learning algorithm, and then use the model to extract domain terms from the test corpus Experiments are carried out to verify the effectiveness of this algorithm. The machine learning theories that have been used for term recognition in the Chinese field mainly include decision trees, support vector machines, hidden Markov models, maximum entropy models, maximum entropy Markov models, and conditional random field algorithms. The term recognition method based on machine learning does not require expert domain knowledge and language knowledge, and it is highly feasible to implement. It can obtain better recognition or extraction results when considering multiple term features.
目前,基于机器学习的中文领域术语识别方法中条件随机场模型应用最为广泛。例如有文献报道,“一种中医名词术语自动抽取方法”(作者是:张五辈 白宇 王裴岩 张桂平,发表于2011年出版的《沈阳航空航天大学学报》第28卷第1期第72-75页),该文公开了一种针对中医领域的基于条件随机场的术语抽取方法,该方法将中医领域术语抽取看作一个序列标注问题,将中医领域术语分布的特征量化作为训练的特征,利用CRF工具包训练出一个领域术语模型,然后利用该模型进行术语抽取。选择《名医类案》作为中医领域文本进行术语抽取实验,准确率达到83.11%,召回率达到81.04%,F测量值达到82.06%。以及文章“采用CRF技术的军事情报术语自动抽取研究”(作者是:贾美英 杨炳儒 郑德权 杨靖,发表于2009年出版的《计算机工程与应用》第45卷第32期第126-129页),该文公开了一种针对军事情报领域的基于条件随机场的术语抽取方法,该方法将领域术语识别看作一个序列标注问题,将领域术语分布的特征量化作为训练的特征,利用CRF工具包训练出一个领域术语特征模板,然后利用该模板进行领域术语抽取。实验表明,该方法对军事情报领域术语的识别结果良好,准确率可达到73.24%,召回率达到69.57%,F测量值达到71.36%。 At present, the conditional random field model is the most widely used in the recognition method of Chinese domain terms based on machine learning. For example, there is a literature report, "A method for automatic extraction of terminology in traditional Chinese medicine" (authors are: Zhang Wudai, Bai Yu, Wang Peiyan, Zhang Guiping, published in "Journal of Shenyang Aerospace University", Volume 28, Issue 1, Issue 72, published in 2011 -75 pages), this paper discloses a term extraction method based on conditional random field for the field of traditional Chinese medicine. , use the CRF toolkit to train a domain term model, and then use the model for term extraction. Selecting "Famous Doctor Class Cases" as the text in the field of traditional Chinese medicine for term extraction experiments, the accuracy rate reached 83.11%, the recall rate reached 81.04%, and the F measurement value reached 82.06%. And the article "Research on Automatic Extraction of Military Intelligence Terminology Using CRF Technology" (Authors: Jia Meiying, Yang Bingru, Zheng Dequan, Yang Jing, published in "Computer Engineering and Application", Volume 45, Issue 32, Pages 126-129, published in 2009), This paper discloses a conditional random field-based term extraction method for the military intelligence field. This method regards domain term recognition as a sequence labeling problem, quantifies the feature distribution of domain terms as the training feature, and uses the CRF toolkit to train A domain term feature template is generated, and then the domain term extraction is performed using the template. Experiments show that the method has good recognition results for terms in the field of military intelligence, with an accuracy rate of 73.24%, a recall rate of 69.57%, and an F-measurement value of 71.36%.
利用条件随机场算法进行领域术语识别时,训练语料基本上都为手动和半自动标注的,人为参与度都高,工作量大,导致普遍识别量不大,制约了该算法的识别精度和应用。同时,需要先利用通用的分词工具对语料进行分词,然后再对分词后的语料进行条件随机场训练和测试,最终才能实现术语的识别。所以利用条件随机场算法进行领域术语识别的前提是,假设现有的通用分词工具可以对该领域的词汇进行准确地分词,并认为领域术语比分词工具所分的词粒度大。但是,由于专业领域术语与普通词汇存在差距,用一般性分词工具很难实现对专业领域语料的准确分词。因此,目前互信息和条件随机场方法在领域术语识别过程中自动识别程度较低,且识别精度不高。 When the conditional random field algorithm is used for domain term recognition, the training corpus is basically manually and semi-automatically annotated. The degree of human participation is high and the workload is heavy, resulting in a small amount of general recognition, which restricts the recognition accuracy and application of the algorithm. At the same time, it is necessary to use general-purpose word segmentation tools to segment the corpus, and then conduct conditional random field training and testing on the segmented corpus to finally realize term recognition. Therefore, the premise of using conditional random field algorithm to identify domain terms is assuming that the existing general word segmentation tools can accurately segment the vocabulary in the field, and it is believed that the domain term is larger than the word granularity segmented by the segmentation tool. However, due to the gap between professional field terminology and common vocabulary, it is difficult to achieve accurate word segmentation of professional field corpus with general word segmentation tools. Therefore, the current mutual information and conditional random field methods have a low degree of automatic recognition in the process of field term recognition, and the recognition accuracy is not high.
发明内容 Contents of the invention
鉴于以上所述现有技术存在的问题,本发明的目的是提供一种基于互信息和条件随机场模型的中文领域术语识别方法,该方法在术语识别时,不仅能克服合法术语的数据稀疏,降低了条件随机场算法的运算量,而且能够提高中文领域术语识别精度。 In view of the problems in the prior art described above, the purpose of the present invention is to provide a method for identifying Chinese domain terms based on mutual information and conditional random field models, which can not only overcome the data sparseness of legal terms when identifying terms, but also The computational load of the conditional random field algorithm is reduced, and the recognition accuracy of Chinese domain terms can be improved.
为了达到上述目的,本发明采用下述技术方案: In order to achieve the above object, the present invention adopts following technical scheme:
本发明的基于互信息和条件随机场模型的中文领域术语识别方法,具体步骤如下: The Chinese domain term recognition method based on mutual information and conditional random field model of the present invention, concrete steps are as follows:
(1)、收集领域文本语料,对语料中所有的标点符号、空格、数字、ASCII字符以及汉字以外字符进行标记; (1) Collect domain text corpus and mark all punctuation marks, spaces, numbers, ASCII characters and characters other than Chinese characters in the corpus;
(2)、设置字串 ,计算字串的互信息值; (2), set the string , to calculate the string mutual information value;
(3)、计算字串左右信息熵; (3), calculate the string left and right information entropy;
(4)、定义字串评价函数,设置评价函数阈值,计算各字串的评价函数值,确定字串为词,依次比较该字串中前一字的评价函数值与后一字评价函数值相比较,得到各字串中对应的比值,其比值再与评价函数阈值比较,逐一对字义字串分词; (4), define the string evaluation function, set the evaluation function Threshold, calculate the evaluation function value of each string, determine the string is a word, compare the strings in turn Chinese character The value of the evaluation function and the next word Compare the evaluation function values to get each string The corresponding ratio in , and its ratio is then compared with the evaluation function Threshold comparison, one-by-one for literal strings Participle;
(5)、以词、词性、词的出现频率的随机场的训练特征,利用条件随机场方法训练出一个领域术语条件随机场模型,用该模型对进行领域术语识别。 (5) Using the training characteristics of the random field of word, part of speech, and word frequency, use the conditional random field method to train a field term conditional random field model, and use this model to identify field terms.
上述步骤(2)中所述的(2)设置字串,计算字串的互信息值,其计算公式如下: (2) setting string described in step (2) above , to calculate the string The mutual information value of , its calculation formula is as follows:
假设一个领域术语是由n个字组成,如果字串为一个领域术语,那么字串由、、……个字组成,字串的互信息值计算公式如下: Assuming that a domain term is composed of n words, if the string is a domain term, then the string Depend on , , ... composed of words, string The formula for calculating the mutual information value is as follows:
(1) (1)
其中,表示一个由n个字组成的字串; in, Represents a string consisting of n characters;
表示组成字串的第i个字(i=1,2,3,…,n); Indicates the composition of the string The ith word of (i=1, 2, 3, ..., n);
表示语料库中字出现的频次; Represents words in the corpus frequency of occurrence;
表示语料库中字出现的频次; Represents words in the corpus frequency of occurrence;
表示语料库中字出现的频次; Represents words in the corpus frequency of occurrence;
表示语料库中字出现的频次; Represents words in the corpus frequency of occurrence;
表示字、、、…、同时出现的频次; Indicates the word , , ,..., frequency of simultaneous occurrences;
表示字串中所有字与字之间的互信息。 Represents a string Mutual information between all words in .
上述步骤(3)中所述的计算左右信息熵,其计算公式如下: The calculation formula for calculating the left and right information entropy described in the above step (3) is as follows:
左信息熵计算公式为: (2) The left information entropy calculation formula is: (2)
右信息熵计算公式为: (3) The right information entropy calculation formula is: (3)
其中, 表示为给定的一个由n个字组成的字串; in, Represented as a given string consisting of n characters;
和分别表示出现在左侧和右则时的条件概率; and Respectively Appear in Conditional probabilities for left and right time;
和表示左边和右边所有出现的词集合; and express The set of all occurrences of words on the left and right;
表示组成字串的第i个字,其中,i=1,2,3,…,n 。 Indicates the composition of the string The i-th word of , where i=1, 2, 3,..., n.
上述步骤(4)中所述的定义字串W评价函数,并利用评价函数对语料进行分词,是指利用步骤(2)和步骤(3)计算得到的互信息和左右信息熵值,对语料中的字串为词的可信度进行评价,判断该字串是否为词,其中,字串W评价函数计算公式如下: The definition of the word string W evaluation function described in the above step (4), and the use of the evaluation function to segment the corpus refers to the use of the mutual information and left and right information entropy values calculated in steps (2) and (3), and the corpus string in Evaluate the credibility of the word, and judge whether the word string is a word, wherein, the word string W evaluation function calculation formula is as follows:
(4) (4)
其中, 表示为给定的一个由n个字组成的字串; in, Represented as a given string consisting of n characters;
表示字串中字符之间的互信息值; Represents a string Mutual information value between characters in ;
表示字串的左信息熵值; Represents a string The left information entropy value of ;
表示字串的右信息熵值; Represents a string The right information entropy value of ;
为平衡因子,用以调节信息熵与互信息值在字串评价函数中的权值。 is a balance factor, used to adjust the value of information entropy and mutual information in the string Weights in the evaluation function.
上述步骤(5)中所述的以词、词性、词的出现频率的随机场的训练特征,利用条件随机场方法训练出一个领域术语条件随机场模型,利用该模型对进行领域术语识别,其操作步骤如下: In the above step (5), the training characteristics of the random field of word, part of speech, and word occurrence frequency are used to train a field term conditional random field model by using the conditional random field method, and the field term recognition is carried out by using this model. The operation steps are as follows:
(51)、以词本身、词性、词的出现频率在语料中进行标注; (51), mark in the corpus with the word itself, part of speech, and frequency of occurrence of the word;
(52)、利用CRF++ 0.53工具包对已标注的特征序列训练,获取条件随机场参数,该条件随机场参数为该领域术语识别的条件随机场模型; (52), use the CRF++ 0.53 toolkit to train the marked feature sequence to obtain the conditional random field parameter, which is the conditional random field model for term recognition in this field;
(53)、用领域术语识别的的条件随机场模型对测试已标注的特征序列的领域术语识别。 (53). Using the conditional random field model of field term recognition to test the field term recognition of the marked feature sequence.
本发明的基于互信息和条件随机场模型的中文领域术语识别方法与现有技术相比较,具有以下效果: Compared with the prior art, the Chinese domain term recognition method based on mutual information and conditional random field model of the present invention has the following effects:
(1)、该方法将基于统计和机器学习的两类术语识别方法有机地结合在一起,有效的解决了单纯利用统计方法进行术语识别时的数据稀疏问题; (1) This method organically combines two types of term recognition methods based on statistics and machine learning, and effectively solves the problem of data sparseness when simply using statistical methods for term recognition;
(2)、该方法利用互信息算法对语料进行分词和标注,实现了语料的自动标注; (2) This method uses the mutual information algorithm to segment and label the corpus, realizing the automatic labeling of the corpus;
(3)、该方法仅采用了3个最为普通的词特征,作为条件随机场方法的训练,使该方法具有较强的领域通用性,有效地降低了条件随机场的运算量,减少了条件随机场的训练时间。 (3) This method only uses the three most common word features as the training of the conditional random field method, which makes the method have strong field versatility, effectively reduces the calculation amount of the conditional random field, and reduces the condition Random field training time.
附图说明 Description of drawings
图1为本发明的基于互信息和条件随机场模型的中文领域术语识别方法的流程图; Fig. 1 is the flow chart of the Chinese domain term recognition method based on mutual information and conditional random field model of the present invention;
图2是图1中步骤(4)的流程图; Fig. 2 is the flowchart of step (4) among Fig. 1;
图3是图1中步骤(5)的流程图。 Fig. 3 is a flowchart of step (5) in Fig. 1 .
具体实施方式 Detailed ways
下面结合附图和具体实施方式对本发明作进一步详细的描述。 The present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.
本实施例以植物——竹子的领域术语识别作为实例对本发明进行说明,但不用来限制本发明的范围。 In this embodiment, the field term recognition of plant-bamboo is taken as an example to illustrate the present invention, but it is not used to limit the scope of the present invention.
参照图1,本发明的基于互信息和条件随机场模型的中文领域术语识别方法,包括如下步骤: With reference to Fig. 1, the Chinese field term recognition method based on mutual information and conditional random field model of the present invention, comprises the steps:
(1)、收集领域文本语料,对语料中所有的标点符号、空格、数字、ASCII字符以及汉字以外字符进行标记。 (1) Collect domain text corpus, and mark all punctuation marks, spaces, numbers, ASCII characters and characters other than Chinese characters in the corpus.
例如,本实例选取《中国植物志》第9卷竹亚科的电子书稿作为领域文本语料。 For example, this example selects the e-book manuscript of the ninth volume of "Flora of China" as the domain text corpus.
首先,将语料按4:1的比例随机地划分为:训练语料和测试语料两部分; First, the corpus is randomly divided into two parts according to the ratio of 4:1: training corpus and test corpus;
然后,检索出语料中所有标点符号、空格、数字、ASCII字符以及汉字以外字符,在上述字符前、后分别用“//”符号进行标记; Then, retrieve all punctuation marks, spaces, numbers, ASCII characters and characters other than Chinese characters in the corpus, and mark them with "//" symbols before and after the above characters;
最后,参照汉语词性表,对所有代词、叹词、助词和虚词,以及首字为“和、有、的、得、将、把、从、了、是、则、在、每、这、该、给、所、使、为、不、着、了、很、该、与、得、的”词的前、后分别用“//”符号进行标记。 Finally, referring to the Chinese Parts of Speech Table, for all pronouns, interjections, auxiliary words and function words, as well as the initial characters of "he, you, de, get, will, put, from, got, is, then, in, every, this, the , Give, So, Make, For, Not, Write, Got, Very, Should, With, Get, The front and back of the words are marked with "//" symbols respectively.
(2)、设置字串,计算字串的互信息值,其计算公式如下: (2), set the string , to calculate the string The mutual information value of , its calculation formula is as follows:
假设一个领域术语是由n个字组成,如果字串为一个领域术语,那么字串由、、……个字组成,字串的互信息值计算公式如下: Assuming that a domain term is composed of n words, if the string is a domain term, then the string Depend on , , ... composed of words, string The formula for calculating the mutual information value is as follows:
(1) (1)
其中,表示一个由n个字组成的字串; in, Represents a string consisting of n characters;
表示组成字串的第i个字,其中,i=1,2,3,…,n; Indicates the composition of the string The i-th word of , where i=1, 2, 3,..., n;
表示语料库中字出现的频次; Represents words in the corpus frequency of occurrence;
表示语料库中字出现的频次; Represents words in the corpus frequency of occurrence;
表示语料库中字出现的频次; Represents words in the corpus frequency of occurrence;
表示语料库中字出现的频次; Represents words in the corpus frequency of occurrence;
表示字、、、…、同时出现的频次; Indicates the word , , ,..., frequency of simultaneous occurrences;
表示字串中所有字与字之间的互信息。 Represents a string Mutual information between all words in .
由于本发明认为中文领域术语的长度不大于4个字,并且认为中文领域术语中间不可能出现标点符号、空格、数字、ASCII字符以及汉字以外字符,同时也不可能出再叹词、虚词、指标代词等词,所以本发明对语料文本中所有字分别计算其2-word、3-word、4-word的互信息值,当遇到标记符“//”停止计算,其互信息值的计算公式参见上述发明内容中步骤(2)的公式(1)、(2)、(3)。 Because the present invention thinks that the length of the term in the Chinese field is not more than 4 characters, and thinks that punctuation marks, spaces, numbers, ASCII characters, and characters other than Chinese characters cannot appear in the middle of the Chinese field term, and it is also impossible to appear interjections, function words, and indicators Pronouns and other words, so the present invention calculates the mutual information values of 2-word, 3-word, and 4-word for all words in the corpus text, and stops calculation when the marker "//" is encountered, and the calculation of the mutual information value For formulas, refer to formulas (1), (2) and (3) in step (2) in the above summary of the invention.
例如:语料“边缘被流苏状毛//,//”,其中2-word包括:“边缘”、“缘被”、“被流”、“流苏”、“苏状”和“状毛”;3-word包括:“边缘被”、“缘被流”、“被流苏”、“流苏状”和“苏状毛”;4-word包括:“边缘被流”、“缘被流苏”、“被流苏状”和“流苏状毛”,部份互信息计算结果为:,,,,; For example: the corpus "edge is fringed hair //, //", where 2-words include: "edge", "marginal cover", "be flowed", "tassel", "su-like" and "like hair"; 3-word includes: "edge is flowed", "edge is flowed", "be tasseled", "fringe-like" and "su-like hair"; 4-word includes: "edge is flowed", "edge is tasseled", "Fringed" and "fringed hair", the partial mutual information calculation results are: , , , , ;
(3)、计算字串左右信息熵,其计算公式如下: (3), calculate the string Left and right information entropy, its calculation formula is as follows:
左信息熵计算公式为: The left information entropy calculation formula is:
(2) (2)
右信息熵计算公式为: The right information entropy calculation formula is:
(3) (3)
其中, 表示为给定的一个由n个字组成的字串; in, Represented as a given string consisting of n characters;
和分别表示出现在左侧和右则时的条件概率; and Respectively Appear in Conditional probabilities for left and right time;
和表示左边和右边所有出现的词集合; and express The set of all occurrences of words on the left and right;
表示组成字串的第i个字,其中,i=1,2,3,…,n。 Indicates the composition of the string The i-th word of , where i=1, 2, 3,..., n.
判断一个字串是否为词,不仅要考虑字串内部字与字之间的结合紧密度,即字之间互信息的大小;同时,还要考虑字串之间的边界自由程度,即在字串边界出现的邻接字的种类越多,认为字串左右信息熵越大,也就是字串边界的自由度越大,其左右信息熵的计算公式参见上述发明内容中步骤(3)的公式(2)、(3)。 To judge whether a string is a word or not, not only the degree of combination between characters within the string, that is, the size of the mutual information between words, but also the degree of freedom of boundaries between strings, that is, the The more types of adjacent words appearing on the string boundary, the greater the left and right information entropy of the string is considered, that is, the greater the degree of freedom of the word string boundary, and the calculation formula of the left and right information entropy refers to the formula of step (3) in the above-mentioned content of the invention ( 2), (3).
例如:语料“边缘被流苏状毛//,//”中,部份左信息熵计算结果为:,,,,,;右信息熵计算结果为:,,,, ,; For example: in the corpus "the edge is fringed hair //, //", the calculation result of part of the left information entropy is: , , , , , ; The calculation result of right information entropy is: , , , , , ;
(4)、定义字串评价函数,设置评价函数阈值,计算各字串的评价函数值,确定字串为词,依次比较该字串中前一字的评价函数值与后一字评价函数值相比较,得到各字串中对应的比值,其比值再与评价函数阈值比较,逐一对字义字串分词,其操作步骤如下: (4), define the string evaluation function, set the evaluation function Threshold, calculate the evaluation function value of each string, determine the string is a word, compare the strings in turn Chinese character The value of the evaluation function and the next word Compare the evaluation function values to get each string The corresponding ratio in , and its ratio is then compared with the evaluation function Threshold comparison, one-by-one for literal strings Word segmentation, the operation steps are as follows:
(41)、定义字串评价函数,其计算表达式为: (41), define the string Evaluation function, its calculation expression is:
(4) (4)
其中,表示为给定的一个由n个字组成的字串; in, Represented as a given string consisting of n characters;
表示字串中字符之间的互信息值; Represents a string Mutual information value between characters in ;
表示字串的左信息熵值; Represents a string The left information entropy value of ;
表示字串的右信息熵值; Represents a string The right information entropy value of ;
为平衡因子,用以调节信息熵与互信息值在评价函数中的权值。 It is a balance factor, which is used to adjust the weight of information entropy and mutual information value in the evaluation function.
(42)、分别计算评价函数数值,确定字串为词。 (42), respectively calculate the value of the evaluation function, determine the string for words.
根据上述发明内容中的步骤(4)的评价函数公式分别计算所有字串的评价函数值,其中取0.5,并认为当评价函数大于阈值0.8时,该字串为词, Calculate the evaluation function values of all strings according to the evaluation function formula of step (4) in the above-mentioned summary of the invention, wherein Take 0.5, and think that when the evaluation function When greater than the threshold 0.8, the string for words,
例如:语料“边缘被流苏状毛//,//”,部份评价函数计算结果为:,,,, ,; For example: in the corpus "the edge is fringed hair //, //", the calculation result of some evaluation functions is: , , , , , ;
(43)、依次比较上述字串中前一字的评价函数值与后一字评价函数值相比,得到各字串中对应的比值“?”,其比值再与评价函数阈值比较,逐一对字义字串分词。 (43), compare the above strings in turn Chinese character The value of the evaluation function and the next word Evaluation function values are compared to get each string The corresponding ratio "?" in, and its ratio is then compared with the evaluation function Threshold comparison, one-by-one for literal strings Participle.
例如,首先从语料的第一个字开始,分别选取长度为4、3、2、1的子字串,记作、、和; For example, firstly, starting from the first word of the corpus, select substrings with lengths of 4, 3, 2, and 1 respectively, and write them as , , and ;
然后,对字串和的评价函数进行比较,如果,认为字串为新词,d在字串前后分别以符号“*”进行标注;反之,认为字串不是新词,则其丢弃尾部的最后一个字,对和的评价函数进行比较,如果,认为字串为新词,在字串前后分别以符号“*”进行标注;反之,认为字串不是新词,其丢弃尾部的最后一个字对的评价函数进行判断,如果,认为字串为新词,在字串前后分别以符号“*”进行标注;反之,认为字串为新词,在字串前后分别以符号“*”进行标注;只要有新词被标注,就从新词后的第一个字开始,再分别选取长度为4、3、2、1的子字串,记作、、和,重新进行评价函数的比较,当遇到“//”符号跳过。如此反复, 直至所以语料处理完为止,例如:语料“边缘被流苏状毛//,//”,首先,从第一个字开始截取长度分别为4、3、2、1的子字串,即:“边缘被流”、“边缘被”、“边缘”和“边”;然后,首先判断是否大于等于0.8,根据步骤(41)评价函数的计算结果,可知小于0.8,即字串“边缘被流”不是新词;然后,判断是否大于等于0.8,根据步骤(41)评价函数的计算结果,可知小于0.8,故字串“边缘被”也不是新词;接着判断是否大于等于0.8,根据步骤(41)评价函数的计算结果,可知大于0.8,故字串“边缘”是新词;当有判断出新词后,从新词后的第一个字开始再选取4、3、2、1个字串,作为新一轮的作、、和,即“被流苏状”、“波流苏”、“被流”和“被”,再重复以上步骤进行比较,当遇到“//”符号跳过,直到结束,所以语料“边缘被流苏状毛//,//”,最后分词结果为“*边缘*被*流苏状*毛//,//” ; Then, for the string and The evaluation function is compared, if , consider the string as a new word, d in the string Mark with the symbol "*" before and after; otherwise, consider the string is not a new word, it discards the last word at the end, for and The evaluation function is compared, if , consider the string for new words, in the string Mark with the symbol "*" before and after; otherwise, consider the string is not a new word, which discards the last word pair at the end Evaluation function to judge, if , consider the string for new words, in the string Mark with the symbol "*" before and after; otherwise, consider the string for new words, in the string The symbol "*" is used to mark the front and back respectively; as long as a new word is marked, start from the first word after the new word, and then select substrings with lengths of 4, 3, 2, and 1 respectively, and write them as , , and , to re-comparison the evaluation function, and skip when the "//" symbol is encountered. Repeat this until all the corpus is processed, for example: the corpus "the edge is fringed hair //, //", first, start from the first word to intercept substrings with lengths of 4, 3, 2, 1 respectively, That is: "edge is flowed", "edge is", "edge" and "edge"; then, first judge Whether it is greater than or equal to 0.8, according to the calculation result of the evaluation function in step (41), it can be known that is less than 0.8, that is, the word string "edge is flowed" is not a new word; then, judge Whether it is greater than or equal to 0.8, according to the calculation result of the evaluation function in step (41), it can be known that is less than 0.8, so the word string "marginal quilt" is not a new word; then judge Whether it is greater than or equal to 0.8, according to the calculation result of the evaluation function in step (41), it can be known that is greater than 0.8, so the word string "edge" is a new word; when a new word is judged, select 4, 3, 2, and 1 word strings from the first word after the new word as a new round of writing , , and , that is, "be fringed", "wave tassel", "be flowed" and "be", repeat the above steps for comparison, when encountering the "//" symbol skip until the end, so the corpus "edge is fringed" Mao//, //", the final word segmentation result is "*edge*be*fringed*hair//,//";
(5)、以词、词性、词的出现频率的随机场的训练特征,利用条件随机场训练出一个领域术语条件随机场模型,用该模型对进行领域术语识别,其操作步骤如下: (5) Using the training characteristics of the random field of word, part of speech, and word frequency, use the conditional random field to train a field term conditional random field model, and use this model to identify field terms. The operation steps are as follows:
(51)、以词本身、词性、词的出现频率在语料中进行标注,其具体如下: (51), mark in the corpus with the word itself, part of speech, and frequency of occurrence of the word, as follows:
依次对字义字串分词标注特征序列,该词的标注的特征序列分别为:当前词本身;当前词的词性;当前词的出现频率,采用K-Means聚类方法,将上述当前词的出现频率分为10个等级,每个等级为一类,10个类分别表示为A、B、C、D、E、F、G、H、I、J、K,将已标注的特征序列分为:训练已标注的特征序列、测试已标注的特征序列两部份; literal string The feature sequence of word segmentation, the feature sequence of the tag of the word is: the current word itself; the part of speech of the current word; the frequency of occurrence of the current word, using the K-Means clustering method to divide the frequency of the above current word into 10 levels , each level is a class, and the 10 classes are respectively represented as A, B, C, D, E, F, G, H, I, J, K, and the marked feature sequence is divided into: training marked features Sequence, test the two parts of the marked feature sequence;
(52)、利用CRF++ 0.53工具包对已标注的特征序列训练,获取条件随机场参数,条件随机场参数为领域术语识别的条件随机场模型; (52), use the CRF++ 0.53 toolkit to train the marked feature sequence, and obtain the conditional random field parameters, which are the conditional random field models for domain term recognition;
(53)、用领域术语识别的条件随机场模型对测试已标注的特征序列的领域术语识别,其具体如下: (53), use the conditional random field model of field term recognition to test the field term recognition of the marked feature sequence, which is as follows:
将测试已标注的特征序列输入到步骤(5.2)训练后获得领域术语识别的条件随机场模型,利用该条件随机场模型 ,计算出特征值,识别出领域术语,输出结果为识别出的领域术语,例如:语料“边缘被流苏状毛//,//”,最终识别出“边缘”和“流苏状”为领域术语。 Input the marked feature sequence of the test into the conditional random field model for field term recognition after training in step (5.2), use the conditional random field model to calculate the feature value, identify the field term, and output the recognized field term , For example: the corpus "edge is fringed hair //, //", and finally recognizes "edge" and "fringe" as domain terms.
以上为本发明的最佳实施方式,依据本发明公开的内容,本领域技术人员能够显而易见地想到一些雷同、替代方案,均应属于本发明的技术创新范围。 The above are the best implementation modes of the present invention. According to the disclosed content of the present invention, those skilled in the art can obviously think of some similarities and alternatives, which should all belong to the technical innovation scope of the present invention.
Claims (5)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210528734.8A CN103049501B (en) | 2012-12-11 | 2012-12-11 | Based on mutual information and the Chinese domain term recognition method of conditional random field models |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210528734.8A CN103049501B (en) | 2012-12-11 | 2012-12-11 | Based on mutual information and the Chinese domain term recognition method of conditional random field models |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103049501A true CN103049501A (en) | 2013-04-17 |
CN103049501B CN103049501B (en) | 2016-08-03 |
Family
ID=48062142
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201210528734.8A Expired - Fee Related CN103049501B (en) | 2012-12-11 | 2012-12-11 | Based on mutual information and the Chinese domain term recognition method of conditional random field models |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103049501B (en) |
Cited By (30)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103593427A (en) * | 2013-11-07 | 2014-02-19 | 清华大学 | New word searching method and system |
CN103778243A (en) * | 2014-02-11 | 2014-05-07 | 北京信息科技大学 | Domain term extraction method |
CN103902673A (en) * | 2014-03-19 | 2014-07-02 | 新浪网技术(中国)有限公司 | Anti-garbage-filtering rule upgrading method and device |
CN104572621A (en) * | 2015-01-05 | 2015-04-29 | 语联网(武汉)信息技术有限公司 | Decision tree based term judgment method |
CN104679885A (en) * | 2015-03-17 | 2015-06-03 | 北京理工大学 | User search string organization name recognition method based on semantic feature model |
CN105183923A (en) * | 2015-10-27 | 2015-12-23 | 上海智臻智能网络科技股份有限公司 | New word discovery method and device |
CN105224682A (en) * | 2015-10-27 | 2016-01-06 | 上海智臻智能网络科技股份有限公司 | New word discovery method and device |
CN105260362A (en) * | 2015-10-30 | 2016-01-20 | 小米科技有限责任公司 | New word extraction method and device |
CN105389349A (en) * | 2015-10-27 | 2016-03-09 | 上海智臻智能网络科技股份有限公司 | Dictionary updating method and apparatus |
CN106021230A (en) * | 2016-05-19 | 2016-10-12 | 无线生活(杭州)信息科技有限公司 | Word segmentation method and word segmentation apparatus |
CN106095753A (en) * | 2016-06-07 | 2016-11-09 | 大连理工大学 | A kind of financial field based on comentropy and term credibility term recognition methods |
WO2016179988A1 (en) * | 2015-05-12 | 2016-11-17 | 深圳市华傲数据技术有限公司 | Chinese address parsing and annotation method |
CN106202056A (en) * | 2016-07-26 | 2016-12-07 | 北京智能管家科技有限公司 | Chinese word segmentation scene library update method and system |
CN106445921A (en) * | 2016-09-29 | 2017-02-22 | 北京理工大学 | Chinese text term extracting method utilizing quadratic mutual information |
CN106649661A (en) * | 2016-12-13 | 2017-05-10 | 税云网络科技服务有限公司 | Method and device for establishing knowledge base |
CN106991085A (en) * | 2017-04-01 | 2017-07-28 | 中国工商银行股份有限公司 | The abbreviation generation method and device of a kind of entity |
CN107291692A (en) * | 2017-06-14 | 2017-10-24 | 北京百度网讯科技有限公司 | Method for customizing, device, equipment and the medium of participle model based on artificial intelligence |
CN107391486A (en) * | 2017-07-20 | 2017-11-24 | 南京云问网络技术有限公司 | A kind of field new word identification method based on statistical information and sequence labelling |
CN107423278A (en) * | 2016-05-23 | 2017-12-01 | 株式会社理光 | The recognition methods of essential elements of evaluation, apparatus and system |
CN108268440A (en) * | 2017-01-04 | 2018-07-10 | 普天信息技术有限公司 | A kind of unknown word identification method |
CN108509425A (en) * | 2018-04-10 | 2018-09-07 | 中国人民解放军陆军工程大学 | Chinese new word discovery method based on novelty |
CN108776653A (en) * | 2018-05-25 | 2018-11-09 | 南京大学 | A kind of text segmenting method of the judgement document based on PageRank and comentropy |
CN108804413A (en) * | 2018-04-28 | 2018-11-13 | 百度在线网络技术(北京)有限公司 | The recognition methods of text cheating and device |
CN109145282A (en) * | 2017-06-16 | 2019-01-04 | 贵州小爱机器人科技有限公司 | Punctuate model training method, punctuate method, apparatus and computer equipment |
CN109492224A (en) * | 2018-11-07 | 2019-03-19 | 北京金山数字娱乐科技有限公司 | A kind of method and device of vocabulary building |
CN109710947A (en) * | 2019-01-22 | 2019-05-03 | 福建亿榕信息技术有限公司 | Method and device for generating electric power professional thesaurus |
CN110175331A (en) * | 2019-05-29 | 2019-08-27 | 三角兽(北京)科技有限公司 | Recognition methods, device, electronic equipment and the readable storage medium storing program for executing of technical term |
CN111090742A (en) * | 2019-12-19 | 2020-05-01 | 东软集团股份有限公司 | Question and answer pair evaluation method and device, storage medium and equipment |
CN115495507A (en) * | 2022-11-17 | 2022-12-20 | 江苏鸿程大数据技术与应用研究院有限公司 | Engineering material information price matching method, system and storage medium |
CN116702786A (en) * | 2023-08-04 | 2023-09-05 | 山东大学 | Chinese professional term extraction method and system integrating rules and statistical features |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106202043B (en) * | 2016-05-20 | 2019-04-12 | 北京理工大学 | A kind of new word identification immune genetic method based at word rate fitness function |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101295294A (en) * | 2008-06-12 | 2008-10-29 | 昆明理工大学 | Improved Bayesian Word Sense Disambiguation Method Based on Information Gain |
US20100088353A1 (en) * | 2006-10-17 | 2010-04-08 | Samsung Sds Co., Ltd. | Migration Apparatus Which Convert Database of Mainframe System into Database of Open System and Method for Thereof |
CN102314507A (en) * | 2011-09-08 | 2012-01-11 | 北京航空航天大学 | Recognition ambiguity resolution method of Chinese named entity |
-
2012
- 2012-12-11 CN CN201210528734.8A patent/CN103049501B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100088353A1 (en) * | 2006-10-17 | 2010-04-08 | Samsung Sds Co., Ltd. | Migration Apparatus Which Convert Database of Mainframe System into Database of Open System and Method for Thereof |
CN101295294A (en) * | 2008-06-12 | 2008-10-29 | 昆明理工大学 | Improved Bayesian Word Sense Disambiguation Method Based on Information Gain |
CN102314507A (en) * | 2011-09-08 | 2012-01-11 | 北京航空航天大学 | Recognition ambiguity resolution method of Chinese named entity |
Non-Patent Citations (3)
Title |
---|
周浪 等: "一种面向术语抽取的短语过滤技术", 《计算机工程与应用》, no. 19, 31 December 2009 (2009-12-31), pages 9 - 11 * |
贾美英 等: "采用CRF技术的军事情报术语自动抽取研究", 《计算机工程与应用》, no. 32, 31 December 2009 (2009-12-31), pages 126 - 129 * |
赵秦怡 等: "一种基于互信息的串扫描中文文本分词方法", 《情报杂志》, vol. 29, no. 7, 31 July 2010 (2010-07-31), pages 152 - 172 * |
Cited By (53)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103593427A (en) * | 2013-11-07 | 2014-02-19 | 清华大学 | New word searching method and system |
CN103778243A (en) * | 2014-02-11 | 2014-05-07 | 北京信息科技大学 | Domain term extraction method |
CN103778243B (en) * | 2014-02-11 | 2017-02-08 | 北京信息科技大学 | Domain term extraction method |
CN103902673A (en) * | 2014-03-19 | 2014-07-02 | 新浪网技术(中国)有限公司 | Anti-garbage-filtering rule upgrading method and device |
CN103902673B (en) * | 2014-03-19 | 2017-11-24 | 新浪网技术(中国)有限公司 | Anti-spam filtering rule upgrade method and device |
CN104572621A (en) * | 2015-01-05 | 2015-04-29 | 语联网(武汉)信息技术有限公司 | Decision tree based term judgment method |
CN104572621B (en) * | 2015-01-05 | 2018-01-26 | 语联网(武汉)信息技术有限公司 | A kind of term decision method based on decision tree |
CN104679885A (en) * | 2015-03-17 | 2015-06-03 | 北京理工大学 | User search string organization name recognition method based on semantic feature model |
WO2016179988A1 (en) * | 2015-05-12 | 2016-11-17 | 深圳市华傲数据技术有限公司 | Chinese address parsing and annotation method |
CN105389349A (en) * | 2015-10-27 | 2016-03-09 | 上海智臻智能网络科技股份有限公司 | Dictionary updating method and apparatus |
CN105389349B (en) * | 2015-10-27 | 2018-07-27 | 上海智臻智能网络科技股份有限公司 | Dictionary update method and device |
CN108875040A (en) * | 2015-10-27 | 2018-11-23 | 上海智臻智能网络科技股份有限公司 | Dictionary update method and computer readable storage medium |
CN105224682B (en) * | 2015-10-27 | 2018-06-05 | 上海智臻智能网络科技股份有限公司 | New word discovery method and device |
CN108897842A (en) * | 2015-10-27 | 2018-11-27 | 上海智臻智能网络科技股份有限公司 | Computer readable storage medium and computer system |
CN105224682A (en) * | 2015-10-27 | 2016-01-06 | 上海智臻智能网络科技股份有限公司 | New word discovery method and device |
CN105183923A (en) * | 2015-10-27 | 2015-12-23 | 上海智臻智能网络科技股份有限公司 | New word discovery method and device |
CN108897842B (en) * | 2015-10-27 | 2021-04-09 | 上海智臻智能网络科技股份有限公司 | Computer readable storage medium and computer system |
CN108875040B (en) * | 2015-10-27 | 2020-08-18 | 上海智臻智能网络科技股份有限公司 | Dictionary updating method and computer-readable storage medium |
CN105260362A (en) * | 2015-10-30 | 2016-01-20 | 小米科技有限责任公司 | New word extraction method and device |
CN106021230A (en) * | 2016-05-19 | 2016-10-12 | 无线生活(杭州)信息科技有限公司 | Word segmentation method and word segmentation apparatus |
CN106021230B (en) * | 2016-05-19 | 2018-11-23 | 无线生活(杭州)信息科技有限公司 | A kind of segmenting method and device |
CN107423278A (en) * | 2016-05-23 | 2017-12-01 | 株式会社理光 | The recognition methods of essential elements of evaluation, apparatus and system |
CN107423278B (en) * | 2016-05-23 | 2020-07-14 | 株式会社理光 | Evaluation element identification method, device and system |
CN106095753A (en) * | 2016-06-07 | 2016-11-09 | 大连理工大学 | A kind of financial field based on comentropy and term credibility term recognition methods |
CN106095753B (en) * | 2016-06-07 | 2018-11-06 | 大连理工大学 | A kind of financial field term recognition methods based on comentropy and term confidence level |
CN106202056A (en) * | 2016-07-26 | 2016-12-07 | 北京智能管家科技有限公司 | Chinese word segmentation scene library update method and system |
CN106202056B (en) * | 2016-07-26 | 2019-01-04 | 北京智能管家科技有限公司 | Chinese word segmentation scene library update method and system |
CN106445921A (en) * | 2016-09-29 | 2017-02-22 | 北京理工大学 | Chinese text term extracting method utilizing quadratic mutual information |
CN106445921B (en) * | 2016-09-29 | 2019-05-07 | 北京理工大学 | Chinese text term extraction method using quadratic mutual information |
CN106649661A (en) * | 2016-12-13 | 2017-05-10 | 税云网络科技服务有限公司 | Method and device for establishing knowledge base |
CN108268440A (en) * | 2017-01-04 | 2018-07-10 | 普天信息技术有限公司 | A kind of unknown word identification method |
CN106991085B (en) * | 2017-04-01 | 2020-08-04 | 中国工商银行股份有限公司 | Entity abbreviation generation method and device |
CN106991085A (en) * | 2017-04-01 | 2017-07-28 | 中国工商银行股份有限公司 | The abbreviation generation method and device of a kind of entity |
CN107291692B (en) * | 2017-06-14 | 2020-12-18 | 北京百度网讯科技有限公司 | Artificial intelligence-based word segmentation model customization method, device, equipment and medium |
CN107291692A (en) * | 2017-06-14 | 2017-10-24 | 北京百度网讯科技有限公司 | Method for customizing, device, equipment and the medium of participle model based on artificial intelligence |
CN109145282A (en) * | 2017-06-16 | 2019-01-04 | 贵州小爱机器人科技有限公司 | Punctuate model training method, punctuate method, apparatus and computer equipment |
CN109145282B (en) * | 2017-06-16 | 2023-11-07 | 贵州小爱机器人科技有限公司 | Sentence-breaking model training method, sentence-breaking device and computer equipment |
CN107391486A (en) * | 2017-07-20 | 2017-11-24 | 南京云问网络技术有限公司 | A kind of field new word identification method based on statistical information and sequence labelling |
CN108509425B (en) * | 2018-04-10 | 2021-08-24 | 中国人民解放军陆军工程大学 | A Novelty-based Chinese New Word Discovery Method |
CN108509425A (en) * | 2018-04-10 | 2018-09-07 | 中国人民解放军陆军工程大学 | Chinese new word discovery method based on novelty |
CN108804413A (en) * | 2018-04-28 | 2018-11-13 | 百度在线网络技术(北京)有限公司 | The recognition methods of text cheating and device |
CN108776653A (en) * | 2018-05-25 | 2018-11-09 | 南京大学 | A kind of text segmenting method of the judgement document based on PageRank and comentropy |
CN109492224B (en) * | 2018-11-07 | 2024-05-03 | 北京金山数字娱乐科技有限公司 | Vocabulary construction method and device |
CN109492224A (en) * | 2018-11-07 | 2019-03-19 | 北京金山数字娱乐科技有限公司 | A kind of method and device of vocabulary building |
CN109710947A (en) * | 2019-01-22 | 2019-05-03 | 福建亿榕信息技术有限公司 | Method and device for generating electric power professional thesaurus |
CN109710947B (en) * | 2019-01-22 | 2021-09-07 | 福建亿榕信息技术有限公司 | Method and device for generating electric power professional thesaurus |
CN110175331A (en) * | 2019-05-29 | 2019-08-27 | 三角兽(北京)科技有限公司 | Recognition methods, device, electronic equipment and the readable storage medium storing program for executing of technical term |
CN111090742A (en) * | 2019-12-19 | 2020-05-01 | 东软集团股份有限公司 | Question and answer pair evaluation method and device, storage medium and equipment |
CN111090742B (en) * | 2019-12-19 | 2024-05-17 | 东软集团股份有限公司 | Question-answer pair evaluation method, question-answer pair evaluation device, storage medium and equipment |
CN115495507B (en) * | 2022-11-17 | 2023-03-24 | 江苏鸿程大数据技术与应用研究院有限公司 | Engineering material information price matching method, system and storage medium |
CN115495507A (en) * | 2022-11-17 | 2022-12-20 | 江苏鸿程大数据技术与应用研究院有限公司 | Engineering material information price matching method, system and storage medium |
CN116702786A (en) * | 2023-08-04 | 2023-09-05 | 山东大学 | Chinese professional term extraction method and system integrating rules and statistical features |
CN116702786B (en) * | 2023-08-04 | 2023-11-17 | 山东大学 | Chinese professional term extraction method and system integrating rules and statistical features |
Also Published As
Publication number | Publication date |
---|---|
CN103049501B (en) | 2016-08-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103049501B (en) | Based on mutual information and the Chinese domain term recognition method of conditional random field models | |
Li et al. | Twiner: named entity recognition in targeted twitter stream | |
CN107526799B (en) | A Deep Learning-Based Knowledge Graph Construction Method | |
CN106997382B (en) | Automatic labeling method and system for innovative creative labels based on big data | |
CN107133213B (en) | A method and system for automatic extraction of text summaries based on algorithm | |
WO2021068339A1 (en) | Text classification method and device, and computer readable storage medium | |
CN104572892B (en) | A Text Classification Method Based on Recurrent Convolutional Network | |
CN102591988B (en) | Short text classification method based on semantic graphs | |
WO2017167067A1 (en) | Method and device for webpage text classification, method and device for webpage text recognition | |
CN106095753B (en) | A kind of financial field term recognition methods based on comentropy and term confidence level | |
CN108710611B (en) | A short text topic model generation method based on word network and word vector | |
CN110598203A (en) | A method and device for extracting entity information of military scenario documents combined with dictionaries | |
CN106845358B (en) | Method and system for feature recognition of handwritten character images | |
CN106372061A (en) | Short text similarity calculation method based on semantics | |
CN103970729A (en) | Multi-subject extracting method based on semantic categories | |
CN103970730A (en) | Method for extracting multiple subject terms from single Chinese text | |
CN102567308A (en) | Information processing feature extracting method | |
CN110705292B (en) | Entity name extraction method based on knowledge base and deep learning | |
CN110347701B (en) | A Target Type Identification Method for Entity Retrieval Query | |
CN108376133A (en) | The short text sensibility classification method expanded based on emotion word | |
CN106227756A (en) | A kind of stock index forecasting method based on emotional semantic classification and system | |
CN102737112B (en) | Concept Relevance Calculation Method Based on Representational Semantic Analysis | |
CN105868347A (en) | Tautonym disambiguation method based on multistep clustering | |
CN108038099A (en) | Low frequency keyword recognition method based on term clustering | |
CN105912525A (en) | Sentiment classification method for semi-supervised learning based on theme characteristics |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20160803 Termination date: 20181211 |