CN102662923A - Entity instance leading method based on machine learning - Google Patents

Entity instance leading method based on machine learning Download PDF

Info

Publication number
CN102662923A
CN102662923A CN2012101218391A CN201210121839A CN102662923A CN 102662923 A CN102662923 A CN 102662923A CN 2012101218391 A CN2012101218391 A CN 2012101218391A CN 201210121839 A CN201210121839 A CN 201210121839A CN 102662923 A CN102662923 A CN 102662923A
Authority
CN
China
Prior art keywords
text
maximum entropy
word
classifier
feathers
Prior art date
Application number
CN2012101218391A
Other languages
Chinese (zh)
Inventor
张萌
王文俊
Original Assignee
天津大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 天津大学 filed Critical 天津大学
Priority to CN2012101218391A priority Critical patent/CN102662923A/en
Publication of CN102662923A publication Critical patent/CN102662923A/en

Links

Abstract

The invention belongs to the technical field of natural language treatment and entity leaning and relates to an entity instance leading method based on the machine learning. The method comprises the following steps of: carrying out linguistic data marking after document pretreatment, selecting various feathers including word feathers, word class feathers and combination feathers of words and word class feathers, and converting the linguistic data and texts to be identified into feature vector modes; carrying out the maximum entropy model training, and obtaining the maximum entropy classifier through the marked linguistic data maximum entropy model parameters; and carrying out instance extraction by the maximum entropy classifier. The method has the advantage that the entity instance can be fast and effectively leaned from a large number of texts.

Description

—种基于机器学习的本体实例学习方法 - Examples of types of ontology learning method based on machine learning

所属技术领域 Those of skill

[0001] 本发明涉及自然语言处理和本体学习技术领域。 [0001] The present invention relates to natural language processing and ontology learning technical field. 主要是根据本体模型的特点,吸取自然语言处理中基于机器学习处理文本的方法与经验,进行本体实例的学习。 The main body of the model is based on the characteristics, methods and learn from experience in natural language processing based on machine learning for processing text, learning body instance.

背景技术 Background technique

[0002]目前,传统网络上的信息大部分是非结构化的,缺少组织性,并且存在大量无用的、冗余的信息。 [0002] Currently, most of the information on the traditional network is unstructured, lack of organization, and there is a lot of useless and redundant information. 而互联网信息爆炸性的增长更为处理信息、获取知识带来了困难。 The explosive growth of Internet information processing more information, access to knowledge makes it difficult. 本体是共享概念模型的明确的形式化规范说明,在语义网中扮演着使各服务层间良好的交流、理解的角色。 Ontology is an explicit formal specification of a shared conceptual model, it plays the inter-exchange service layer good in the Semantic Web, the role of understanding. 本体为语义网的建立提供了知识库和规则库,以此为基础可以进行语义搜索、智能工作。 Ontology provides the knowledge base and the rule base for the establishment of the Semantic Web as a basis can be semantic search, intelligent work. 当前有许多研究侧重于如何构建本体的概念、关系。 There are currently many studies have focused on how to build the ontology of the relationship. 而对于已经初步构建完成的本体而言,特别是将其于使用性较强的系统中时,如何从大量的非结构化数据中抽取出本体实例也是值得思考的问题。 As for the initial construction has been completed in terms of the body, especially to the use of a strong system, the problem of how to extract the body from the example is worth considering large amounts of unstructured data. 一方面,完整的本体应包括各种概念、关系的实例,另ー方面,不断吸取新的实例有助于本体模型的完善,使本体向更好的方向发展。 On the one hand, the complete body should include examples of the various concepts, relationships, and other aspects ー, continuing to draw new instance of the model helps to improve the body, the body development for the better.

[0003]目前,主要有两方面的研究涉及本体实例的生成:ー类工作以本体实例生成为主要目的。 [0003] Currently, there are two studies involving generating body instance: Example ー type of work to be generated as the main purpose of the body. 此类方法多以模式匹配方法为核心。 Such methods make more use pattern matching method as the core. 在另ー类工作中,本体实例及属性的学习通常是通过使用基于本体的信息抽取技术实现的。 In another class ー work, study ontology instances and attributes based on information typically implemented by using extraction technology body. 在这里,本体实例作为研究的副产品,例如在基于本体的抽取系统中,研究者在信息抽取的过程中充分利用本体的特点提高抽取的效率、精度,这个过程最終会产生大量的本体实例。 Here, examples of the body as a byproduct of the research, for example in the ontology-based extraction system, the extraction procedure information researchers make full use of the characteristics of the body to improve the extraction efficiency, accuracy, this process will eventually produce a large amount of ontology instances. 很多基于本体的抽取系统都采用GATE框架,引入已经构建完成的本体进行命名实体的识别。 Many extraction system body based GATE frame are used, the introduction of the body has been constructed to identify named entities.

[0004] 随着互联网的迅速发展,信息量日趋庞大,这就提出了如何从大量的非结构化数据中自动化的学习本体实例及属性的问题。 [0004] With the rapid development of the Internet, increasingly large amount of information, which proposes how to automate unstructured data from a large number of problems in learning ontology instances and attributes. 而这些方法中大多数使用基于规则或是模式匹配的方法。 Most of these methods or using a rule-based pattern matching method. 此类方法的特点是易于理解,实现简单、快速。 Features such methods are easy to understand, simple and fast. 与此同时,也存在灵活性不强,需要过多人工參与等不足。 At the same time, there is also flexibility is not strong, the need for excessive human intervention and other issues.

发明内容 SUMMARY

[0005] 为了克服现有技术的上述不足,本发明提出ー种快速、有效地从大量文本数据进行本体实例学习的方法,形成结构化的信息,扩充本体实例,完成从非结构化数据向机器可理解的结构化信息的转变。 [0005] In order to overcome the above deficiencies of the prior art, the present invention proposes ー species quickly, efficiently from a large amount of text data learning method ontology instances, structured information is formed, the docking body instance, from the unstructured data to the machine to complete change understandable structured information. 本发明的技术方案如下: Aspect of the present invention is as follows:

[0006] 一种基于机器学习的本体实例学习方法,用于从文本中识别出属于本体实例的词语,并对其分类,包括下列步骤: [0006] An example of ontology learning method based on machine learning, for recognizing words from the text an example belonging to the body, and its classification, comprising the steps of:

[0007] (I)文档预处理:提取正文部分作为后续步骤的输入; [0007] (I) Pretreatment document: extracting body part as an input to a subsequent step;

[0008] (2)文本预处理:对提取出的正文进行分词、分句处理,形成标注了词性的文本集; [0008] (2) Pretreatment text: the text for the extracted word, clause, to form a labeled set of text parts of speech;

[0009] (3)标注语料:对标注了词性的文本集进行人エ标注,在属于本体实例的词语的后面加上类型标签,形成标注文本,即语料; [0009] (3) tagged corpus: a speech to text labeling sets labeled human Ester, behind the words belonging to the body together with an example of the type of label, the label text is formed, i.e. corpus;

[0010] (4)特征选择:选取包括词特征、词性特征、词和词性特征的组合特征在内的各种、特征,将语料及待识别的文本转换为特征向量的形式; [0010] (4) feature selection: Select features include words, speech feature, combination of features, including word and part of speech of various features, characteristics, materials and convert speech to be recognized in the form of text feature vectors;

[0011] (5)最大熵模型训练:建立最大熵模型,利用标注好的语料训练最大熵模型的參数,得到最大熵分类器;[0012] (6)利用最大熵分类器进行实例抽取:根据选择好的特征,将已经过预处理的文本处理成为分类器能够接受的形式,利用已训练好的最大熵分类器以词为単位进行实例的识别与分类,对于识别出的 [0011] (5) Maximum Entropy model training: establish a maximum entropy model, using the parameters, the tagged training corpus maximum entropy models, maximum entropy classifier; [0012] (6) using the maximum entropy classifier instance extraction: the select good features, the text has been pre-processed into a form acceptable to the classifier, have been trained using a maximum entropy classifier radiolabeling bit word as an example of identification and classification, for the identified

[0013] 本体实例,选择概率值最大的类别作为其所属概念类别的最終結果,实现实例抽取。 [0013] Examples of the body, the maximum probability value is selected as a final result belongs to category class concept, examples of extraction achieved.

[0014] 利用本发明的方法可以快速、有效地从大量文本中学习本体的实例。 [0014] With the method of the invention can be quickly and efficiently from a large number of learning examples of the body text. 这种基于机器学习的能够从训练数据自动的获得知识,从而避免了大量对自然文本语言学上的人力研究。 It can automatically extract knowledge from this training data based on machine learning, so as to avoid a lot of human studies on the nature of text linguistics. 能够较容易地在各个领域中切換,最終服务于多领域的本体学习工作。 It can easily be switched in various fields than the final service in various fields of study and work the body. 同时,可以通过对语料库的扩大而提高性能,符合当前web高速发展的趋势,充分利用网络数据资源,为本体领域的研究、应用提供坚实的数据基础。 At the same time, through the expansion of the corpus to improve performance, in line with current trends in web development of high-speed, full use of network resources data, the data provide a solid foundation for the research body in the field of application.

附图说明 BRIEF DESCRIPTION

[0015] 图I本发明的总流程图。 [0015] Figure I is the general flowchart of the present invention.

[0016] 图2模型训练流程图。 [0016] FIG 2 training flowchart. 图中的Hi代表分类器,同一分类器下面的子类属于同一父类。 Representative Hi in FIG classifier, the classifier following the same subclass belonging to the same parent.

[0017] 图3基于最大熵的本体实例学习流程图。 [0017] FIG. 3 is based on maximum entropy ontology instances to learn a flowchart.

具体实施方式 Detailed ways

[0018] 本发明在学习过程中引入机器学习方法。 [0018] The present invention introduces a method of machine learning in the learning process. 本体模型中的概念类型、层次往往很多,机器学习方法能够处理有细微差别、模糊的概念,从而有效地从文本中抽取出本体的实例及属性。 Ontology concept model types, levels often many, machine learning process can be slightly different, the fuzzy concept so as to effectively extract the body from the examples and text attributes.

[0019] 最大熵是机器学习中的常用模型。 [0019] The maximum entropy is commonly used in machine learning model. 最大熵模型的主要思想是在满足约束条件的情况下,选取使熵最大的分布。 The main idea of ​​maximum entropy models in case satisfy the constraints, select the maximum entropy distribution. 以该模型为理论基础的分类器在自然语言处理中被广泛应用,如命名实体识别、词性标注等问题。 In this theoretical model based classification it is widely used in natural language processing, such as named entity recognition, speech tagging problems. 使用最大熵模型进行实体分类的原理如下:各个实体的上下文信息表示为(X1, χ2,…,xm),该实体所属的类别表示为(yi,y2,…,yP)。 Using the principle of maximum entropy model entities classified as follows: contextual information of each entity is represented as (X1, χ2, ..., xm), belongs to the category of the entity expressed as (yi, y2, ..., yP). 则p(yx)表示在X的条件下该实体被分类为I的概率。 Then p (yx) represents the X under the conditions of the entity is classified as the probability I. P(y χ)应满足以下条件: P (y χ) should meet the following criteria:

[0020] P (1 、= ζ バΛ :■ exP11 ζ;: I: < ·. I' ” [0020] P (1, = ζ ba Λ: ■ exP11 ζ ;: I:. <· I ' "

ハ.M 1 Hi-.M 1

yfj I ............................ yfj I ............................

[0021] .し—一V—ハ I . [0021] shi - a V- Haas I

Iexp1 \ aJAI) Iexp1 \ aJAI)

1 I 1 I

I_ » _1 I_ »_1

[0022] 其中,f,表示特征。 [0022] where, f, represents a feature. λ i是各个特征的參数,它表示了ー个特征对于模型的贡献程度。 λ i is a parameter characteristic of each of which represents the degree of contribution ー characteristic for the model. Z(X)是归ー化常数。 Z (X) is a normalization constant having ー. 在训练过程中,模型利用训练数据的特征得到參数值。 During training, the training data using the model obtained characteristic parameters. 给定ー个新的实体,模型将给出这个实体属于每ー种类型的概率。 Given ー new entity, this model will give the probability of belonging to each entity ー types. 研究者可根据具体情况选择对应最大概率的那ー类作为最終結果,或者选取前几名结果作为候选。 Researchers may be selected according to the circumstances corresponding to the maximum probability that class ー as the final result, or select a candidate as a result of the top few.

[0023] 本发明使用基于最大熵的分类器,能够有效地从大量文本中自动学习本体实例。 [0023] The present invention is based on maximum entropy classifier can be effectively automatically learn from a large body of text in the examples. 下面參见图1,以html文档为出发点,对本发明进行较为详细的说明,主要包括以下几个步骤: Referring now to Figure 1, as the starting point to html files, the present invention will be described in more detail, including the following steps:

[0024] I、文档预处理:主要是解析html文档,去除html标签,提取正文部分作为后续步骤的输入。 [0024] I, the document pre-processing: the main document is parsed html, html tag is removed, the extracted body part as an input of the subsequent step.

[0025] 2、文本预处理:对提取出的正文进行分词、分句处理。 [0025] 2. Pretreatment text: the text to be extracted word, clause processing. 分词是中文自然语言处理任务中非常基础的ー环,并且本方法是以词为单位的,这里采用中科院计算所开发的ICTLCAS平台分词。 Word is Chinese natural language processing tasks very basic ー rings, and this method is based on the word as a unit, where the use of CAS Institute of Computing developed ICTLCAS platform word. 同时,在预处理时需要进行句子边界的探測。 Meanwhile, when the pretreatment is required to detect sentence boundaries. 根据中文文本的特点,采用简单的基于规则的方法即可,即探測“。” “?” “ ! ”等常用的句尾标点。 According to the characteristics of Chinese text, using a simple rule-based approach to that probe. "" "?" "!" And other commonly used sentence punctuation. 最終形成标注了词性的文本集。 Eventually forming a label set speech of text. [0026] 3、基于最大熵的本体实例学习。 [0026] 3, based on maximum entropy ontology learning examples. 在这ー步中,利用最大熵分类器识别出句子中的实体,并为其附上分类标签,即其所属的本体概念的类。ー In this step, maximum entropy sentence classifier identified entities and their classification tag attached, i.e. ontology concept class to which it belongs. 本体概念实例的学习过程与命名实体识别相似,也就是要在文本中抽取出属于地名本体某一概念的实例。 Examples of ontology concept learning process is similar to the named entity recognition, i.e. to extract part of an example of a concept names in the text body. 例如“北京市是中国的首都”,其中“北京”、“中国”均是属于地理实体这ー类别的实例。 For example, "Beijing is the capital of China", which "Beijing", "China" are part of this geographical entity instance ー category. 在具体实现上,主要需要以下几个步骤: In the specific implementation, mainly requires the following steps:

[0027] (I)语料库的标注。 [0027] denoted by (I) corpus. 最大熵模型属于监瞀学习的范畴,模型的训练需要熟语料的支持。 Maximum Entropy model belongs to the category of prison dim learning training model needs the support of sayings material. 语料的标注标准由目标本体模型决定,根据本体中概念的类标识出语料中的实例。 Determined by labeling standard corpus target ontology model, identified according to the class instance corpus ontology concepts. 语料的来源同样是web,首先进行文档预处理,然后在提取出的正文上根据指定好的标准进行人工标注,即在本体实例的后面加上类型标签。 Web source corpus likewise, the document is first pretreated, then good manual annotation criteria specified in the extracted text, i.e., in the back of the body plus the instance type label. 最終形成的标注文本就是所需的语料。 The final form of the label text is required corpus.

[0028] (2)特征选择。 [0028] (2) Select feature. 特征可以表达不同类型实例的特点,是分类、识别的重要指标。 Characteristics Characteristics can express different types of instances, the classification is, an important indicator of the identification. 在处理时,需要将语料及待识别的文本转换为特征向量的形式。 When the processing necessary to convert speech to text to be recognized materials in the form of feature vectors. 最大熵模型的优点之ー是在使用时只需注意特征的选择,这也要求在选择特征时需要十分仔细。 Another advantage of the maximum entropy models ー is sufficient to note when using the select feature, which requires the selection of features need to be very careful. 以下就是本发明所选取的特征: The following is a selected feature of the invention:

[0029] a、词特征:当前词。 [0029] a, word feature: the current word. 选择窗ロ为2,当前词左右第一个词以及左右第二个词的词本身。 Ro selection window 2, the first word about the current word and word about the second word itself.

[0030] b、词性特征:当前词性。 [0030] b, speech feature: the current part of speech. 选择窗ロ为2,当前词左右第一个词以及左右第二个词的词性。 Select Window ro 2, part of speech about the current word and the first word about the second word.

[0031] C、组合特征:将当前词以及词性,左右第一个词以及第ニ个词的词本身和词性分别进行两两组合。 [0031] C, combination of features: the current word and part of speech, a word about the first word and second word itself ni and POS are pairwise combinations.

[0032] d、其他附加特征:结合不同语料自身特点的特征,如后缀词特征等。 [0032] d, other additional features: a combination of different corpus own characteristic features, such as features, etc. suffixes.

[0033] (3)最大熵模型训练。 [0033] (3) maximum entropy model training. 在这ー步中,将利用步骤⑴中标注好的语料训练最大熵模型的參数,最終得到分类器。 In this step ー, we will use the parameters noted in step ⑴ good training corpus maximum entropy model, and ultimately get classifier. 考虑到本体模型中概念较多、分类较细的特点,在训练中充分利用本体概念间的子父类关系。 Considering the large body model concepts, classifications finer features, in training classes take advantage of child-parent relationship between ontology concepts. 为同一父类下的子类训练同一个分类器。 Subclass with a classifier trained under the same parent. 这样避免ー个分类器承受过大的分类压力,也很好的利用了本体模型的层次结构。 This avoids ー classifiers to put too much pressure classification, but also very good use of hierarchical ontology model. 训练过程如图2所示。 Training process as shown in FIG.

[0034] (4)利用最大熵分类器进行实例抽取。 [0034] (4) Examples of extraction using maximum entropy classifier. 根据选择好的特征,将已经过预处理的文本处理成为分类器能够接受的形式,利用在步骤(3)中已训练好的最大熵分类器以词为単位进行实例的识别与分类。 The Select good features, the text has been processed into the pre-classifier acceptable form, using in step (3) has been trained to maximum entropy classifier radiolabeling bit word for identification and classification instance. 分类器会给出ー组当前词属于每ー种候选类别的概率值。 Classifier will give ー groups belonging to the current word probability value for each candidate ー category. 不属于任何本体实例的词语所对应的概率值应均为零。 Probability words not belonging to any ontology instances corresponding to the value should be zero. 对于识别出的本体实例,选择概率值最大的类别作为其所属概念类别的最終結果。 For example the body identified, selecting the maximum value of the probability of the category to which it belongs as the final result of the concept categories.

[0035] 4、本体实例概念映射。 [0035] 4. Examples of ontology concept mapping. 在这上一歩中,我们已经利用分类器抽取出了文本中的实例。 In this last ho, we have used an example of a classifier to extract the text. 在这ー步骤中,将这些实例映射至相应的概念下,并以owl的形式保存。ー In this step, these examples are mapped to the corresponding concept and stored in the form of owl. 图3为基于最大熵的本体实例学习流程图。 3 is based on maximum entropy ontology instances to learn a flowchart.

[0036] 运用本发明中的方法可以针对目标本体,从给定的文本集中学习相应的本体实例。 [0036] Using the method of the present invention can target the body, the body of the respective learning from examples given text concentrated. 在这里,以中文地名本体模型为例说明如何使用本方法学习本体实例。 Here, with Chinese names ontology model as an example of how to use this method of learning ontology instances. 语料来源为中文版维基百科中国行政区划页面,共500篇文章,其中400篇作为训练语料,100篇作为测试语料。 Corpus source for the Chinese version of Wikipedia China administrative division page, a total of 500 articles, of which 400 as the training corpus, 100 as a test corpus. 根据“维基百科”中国行政区划的相关地理信息的特点,选择地理实体这ー类别作为识别、分类的重点。 According to the characteristics of the relevant geographic information of Chinese administrative divisions of "Wikipedia", select this ー geographic entity as a key category identification, classification. “地理实体”这ー概念下包含人文地理实体、自然地理实体等类别,其下又包含多种不同的类别,是分类的重点。 "Geographical entity" concept under which ー include human geography physical, physical geography and other categories of entities, under which in turn contains a number of different categories, is classified focus. 而多数地名关系发生在地理实体之间,因此将其作为利用衡量机器学习方法的重点。 The majority of place names in relations between geographic entities, and therefore use it as a measure of the focus of machine learning methods. 语料库中共有50种类型地理实体的实例,其中的部分类型统计信息如下: A total corpus Example 50 types of geographical entities, which some types of statistical information is as follows:

[0037] 表I人文地理实体统计表 [0037] Table I Places entity Statistics

[0038] I行政区域—^L具有地名意义的纪念」其他具有地名意义的建筑L具有地名意义的设施.[0039] [0038] I administrative areas - ^ L facilities have meaningful names to commemorate the significance of names have "other buildings have names significance of L [0039].

Figure CN102662923AD00061

[0040] 表2自然地理实体统计表 [0040] Table 2 Natural features Statistics

Figure CN102662923AD00062

[0042] I.首先对500篇文档进行预处理。 [0042] I. First 500 documents pretreatment.

[0043] 2.选取400篇作为语料进行标注。 [0043] 2. Select 400 are labeled as the corpus. 语料标注如下: Corpus marked as follows:

[0044][石柱 土家族自治县]DLST-RWDLST-XZQY-S JXZQY-E JXZQYZZX (旧称“石桩县”,1959年称“石柱县” [I])位于[中华人民共和国]DLST-RWDLST-XZQY-GJ[重庆市] [0044] [Shizhu Tujia Autonomous] DLST-RWDLST-XZQY-S JXZQY-E JXZQYZZX (formerly known as "Shizhuang County" in 1959 called "Shizhu" [I]) located [People's Republic of China] DLST-RWDLST-XZQY -GJ [Chongqing]

[0045] DLST-RWDLST-XZQY-Y JXZQY-Y JXZQYCSX 中东部,西临[长江]DLST-ZRDLST-SX-HL,东与[湖北省]DLST-RWDLST-XZQY-YJXZQY-YJXZQYPTX 相邻,距离重庆321 公里,是重庆下辖的四个少数民族自治县中距离重庆最近的ー个。 [0045] DLST-RWDLST-XZQY-Y JXZQY-Y JXZQYCSX in the east, west [Yangtze] DLST-ZRDLST-SX-HL, East and [Hubei Province] DLST-RWDLST-XZQY-YJXZQY-YJXZQYPTX adjacent, from Chongqing 321 km, four minority autonomous county under the jurisdiction of Chongqing from Chongqing ー recent months. 辖12个镇、20个乡。 Jurisdiction over 12 towns, 20 townships. [黄水]DLST-ZRDLST-SX-HL [国家森林公园]DLST-RWDLST-JNLYD-GY [万寿寨]DLST-RffDLST-JNLYD-FJMSQ,准丹霞地貌景观[洗脚溪]DLST-RWDLST-JNLYD-FJMSQ [西沱云梯街]DLST-RWDLST-JNLYD-FJMSQ [Yellow water] DLST-ZRDLST-SX-HL [National Forest] DLST-RWDLST-JNLYD-GY [Longevity Village] DLST-RffDLST-JNLYD-FJMSQ, quasi Danxia landform [feet Creek] DLST-RWDLST-JNLYD -FJMSQ [Xituo ladder Street] DLST-RWDLST-JNLYD-FJMSQ

[0046] 3.根据地名的特点,在基本特征的基础上选择性的加入后缀词特征: [0046] The characteristics of geographical names, based on the basic features on selective addition of suffix word feature:

[0047] (I)人文实体后缀词特征:当前词是否包含了“省,市,县”等后缀词库中的词。 [0047] (I) Humanities entity suffixes features: the current word contains the "provincial, city and county" and other suffixes lexicon of words. 若当前词包含后缀词库中的词则特征值设为1,否则为O。 If the current value of the characteristic words contain a suffix lexicon of words set to 1, otherwise O.

[0048] (2)自然实体后缀词特征:当前词是否包含了“山、河、湖、海”等后缀词库中的词。 [0048] (2) natural entity suffixes features: the current word contains a "mountain, river, lake, sea" and other suffixes lexicon of words. 若当前词包含后缀词库中的词则特征值设为1,否则为O。 If the current value of the characteristic words contain a suffix lexicon of words set to 1, otherwise O.

[0049] 4.模型训练阶段。 [0049] 4. The model training phase. 初始使用一个分类器,只区分两种类别,即本体概念的最上层两个类:人文地理实体和自然地理实体。 The initial use of a classifier, only distinguish two categories, i.e., the uppermost of the two classes of ontology concept: human geography natural geographical entities and entities.

[0050] 5.模型测试。 [0050] The test model. 选取余下的100篇文档做分词、分句处理。 Select the remaining 100 documents do word, clause processing. 根据特征处理成机器学习算法需要的输入形式。 Into a machine learning algorithm requires input in the form of the feature process. 送入分类器进行分类。 Into the classifier for classification. 下表为分类性能:[0051] 表4两类分类效果 The following table shows the classification performance: [0051] TABLE 4 Effect binary classification

[0052] [0052]

Figure CN102662923AD00071

[0053] 本发明在该语料上取得了令人满意的效果,并且相较于基于规则的方法而言更易于移植至新的领域中。 [0053] The present invention achieved satisfactory results in the corpus, and compared in terms of rule-based approach to a new more portable art. 随着语料的不断增多,也可以有效的提高精确率、召回率。 With the growing corpus, can effectively improve the precision rate and the recall rate. 能够应对web级别开放领域的应用。 Able to respond to web applications open field level.

Claims (1)

1. 一种基于机器学习的本体实例学习方法,用于从文本中识别出属于本体实例的词语,并对其分类,包括下列步骤: (1)文档预处理:提取正文部分作为后续步骤的输入; (2)文本预处理:对提取出的正文进行分词、分句处理,形成标注了词性的文本集; (3)标注语料:对标注了词性的文本集进行人工标注,在属于本体实例的词语的后面加上类型标签,形成标注文本,即语料; (4)特征选择:选取包括词特征、词性特征、词和词性特征的组合特征在内的各种特征,将语料及待识别的文本转换为特征向量的形式; (5)最大熵模型训练。 Extracting the input text as part of the subsequent steps: An example based on machine learning ontology learning method for recognizing words from Examples belonging to the body text, and their classification, comprising the steps of: (1) Pretreatment document ; (2) pretreatment text: the text for the extracted word, clause, to form a speech text label set; (3) tagged corpus: speech to text marked manually labeled sets, belonging to the body of the example plus the words back type label, the label text is formed, i.e. corpus; (4) feature selection: select the word feature includes various features, POS feature, combination of features, including word and part of speech features, the materials will be recognized speech text converted into the form of a feature vector; (5) maximum entropy model training. 建立最大熵模型,利用标注好的语料训练最大熵模型的参数,得到最大熵分类器; (6)利用最大熵分类器进行实例抽取:根据选择好的特征,将已经过预处理的文本处理成为分类器能够接受的形式,利用已训练好的最大熵分类器以词为单位进行实例的识别与分类,对于识别出的本体实例,选择概率值最大的类别作为其所属概念类别的最终结果,实现实例抽取。 Establish a maximum entropy model, using the parameters, the tagged training corpus maximum entropy models, maximum entropy classifier; (6) using the maximum entropy classifier instance extraction: Select good features in accordance with the pre-processed text has been processed into classifier acceptable form, have been trained using the maximum entropy classifier to recognize the word as a unit with the classification instance, for ontology instances identified, select the most probable value of the category to which it belongs as the final result of the concept of class, achieve examples of extraction.
CN2012101218391A 2012-04-23 2012-04-23 Entity instance leading method based on machine learning CN102662923A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012101218391A CN102662923A (en) 2012-04-23 2012-04-23 Entity instance leading method based on machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2012101218391A CN102662923A (en) 2012-04-23 2012-04-23 Entity instance leading method based on machine learning

Publications (1)

Publication Number Publication Date
CN102662923A true CN102662923A (en) 2012-09-12

Family

ID=46772418

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012101218391A CN102662923A (en) 2012-04-23 2012-04-23 Entity instance leading method based on machine learning

Country Status (1)

Country Link
CN (1) CN102662923A (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103530282A (en) * 2013-10-23 2014-01-22 北京紫冬锐意语音科技有限公司 Corpus tagging method and equipment
CN103617245A (en) * 2013-11-27 2014-03-05 苏州大学 Bilingual sentiment classification method and device
CN103617290A (en) * 2013-12-13 2014-03-05 江苏名通信息科技有限公司 Chinese machine-reading system
CN103678281A (en) * 2013-12-31 2014-03-26 北京百度网讯科技有限公司 Method and device for automatically labeling text
CN104346327A (en) * 2014-10-23 2015-02-11 苏州大学 Method and device for determining emotion complexity of texts
CN104391902A (en) * 2014-11-12 2015-03-04 清华大学 Maximum entropy topic model-based online document classification method and device
CN105654144A (en) * 2016-02-29 2016-06-08 东南大学 Social network body constructing method based on machine learning
CN105701084A (en) * 2015-12-28 2016-06-22 广东顺德中山大学卡内基梅隆大学国际联合研究院 Characteristic extraction method of text classification on the basis of mutual information
CN105718256A (en) * 2014-12-18 2016-06-29 通用汽车环球科技运作有限责任公司 Methodology and apparatus for consistency check by comparison of ontology models
CN105830060A (en) * 2014-02-06 2016-08-03 富士施乐株式会社 Information processing device, information processing program, storage medium, and information processing method
CN106570002A (en) * 2016-11-07 2017-04-19 网易(杭州)网络有限公司 Natural language processing method and device
CN107622126A (en) * 2017-09-28 2018-01-23 联想(北京)有限公司 The method and apparatus sorted out to the solid data in data acquisition system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0816611A (en) * 1994-06-27 1996-01-19 Sharp Corp Data retrieving device using natural language
US6212532B1 (en) * 1998-10-22 2001-04-03 International Business Machines Corporation Text categorization toolkit
CN101310274A (en) * 2005-11-14 2008-11-19 马克森斯公司 A knowledge correlation search engine

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0816611A (en) * 1994-06-27 1996-01-19 Sharp Corp Data retrieving device using natural language
US6212532B1 (en) * 1998-10-22 2001-04-03 International Business Machines Corporation Text categorization toolkit
CN101310274A (en) * 2005-11-14 2008-11-19 马克森斯公司 A knowledge correlation search engine

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李茹等: "基于汉语框架网的中文问题分类", 《计算机工程与应用》 *
赵文娟: "基于汉语框架本体的网络资源标注", 《中国优秀硕士学位论文全文数据库》 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103530282A (en) * 2013-10-23 2014-01-22 北京紫冬锐意语音科技有限公司 Corpus tagging method and equipment
CN103530282B (en) * 2013-10-23 2016-07-13 北京紫冬锐意语音科技有限公司 Corpus labeling method and equipment
CN103617245A (en) * 2013-11-27 2014-03-05 苏州大学 Bilingual sentiment classification method and device
CN103617290B (en) * 2013-12-13 2017-02-15 江苏名通信息科技有限公司 Chinese machine-reading system
CN103617290A (en) * 2013-12-13 2014-03-05 江苏名通信息科技有限公司 Chinese machine-reading system
CN103678281B (en) * 2013-12-31 2016-10-19 北京百度网讯科技有限公司 The method and apparatus that text is carried out automatic marking
CN103678281A (en) * 2013-12-31 2014-03-26 北京百度网讯科技有限公司 Method and device for automatically labeling text
CN105830060A (en) * 2014-02-06 2016-08-03 富士施乐株式会社 Information processing device, information processing program, storage medium, and information processing method
CN104346327A (en) * 2014-10-23 2015-02-11 苏州大学 Method and device for determining emotion complexity of texts
CN104391902A (en) * 2014-11-12 2015-03-04 清华大学 Maximum entropy topic model-based online document classification method and device
CN105718256A (en) * 2014-12-18 2016-06-29 通用汽车环球科技运作有限责任公司 Methodology and apparatus for consistency check by comparison of ontology models
CN105701084A (en) * 2015-12-28 2016-06-22 广东顺德中山大学卡内基梅隆大学国际联合研究院 Characteristic extraction method of text classification on the basis of mutual information
CN105654144A (en) * 2016-02-29 2016-06-08 东南大学 Social network body constructing method based on machine learning
CN105654144B (en) * 2016-02-29 2019-01-29 东南大学 A kind of social network ontologies construction method based on machine learning
CN106570002A (en) * 2016-11-07 2017-04-19 网易(杭州)网络有限公司 Natural language processing method and device
CN107622126A (en) * 2017-09-28 2018-01-23 联想(北京)有限公司 The method and apparatus sorted out to the solid data in data acquisition system

Similar Documents

Publication Publication Date Title
Yu et al. Resume information extraction with cascaded hybrid model
CN104834747B (en) Short text classification method based on convolutional neural networks
Li et al. Fine-grained location extraction from tweets with temporal awareness
Wan et al. An ensemble sentiment classification system of twitter data for airline services analysis
CN101510221A (en) Enquiry statement analytical method and system for information retrieval
Yang et al. Semi-supervised qa with generative domain-adaptive nets
CN102360383A (en) Method for extracting text-oriented field term and term relationship
Sumathy et al. Text mining: concepts, applications, tools and issues-an overview
CN102254014A (en) Adaptive information extraction method for webpage characteristics
Devika et al. Sentiment analysis: a comparative study on different approaches
DE112013004082T5 (en) Search system of the emotion entity for the microblog
CN102033950A (en) Construction method and identification method of automatic electronic product named entity identification system
Saif et al. Semantic patterns for sentiment analysis of Twitter
CN101996247B (en) Method and device for constructing address database
Zubrinic et al. The automatic creation of concept maps from documents written using morphologically rich languages
CN104298651B (en) Biomedicine named entity recognition and protein interactive relationship extracting on-line method based on deep learning
Demir et al. Improving named entity recognition for morphologically rich languages using word embeddings
Sun et al. Sentiment analysis for Chinese microblog based on deep neural networks with convolutional extension features
Su et al. Chinese sentiment classification using a neural network tool—Word2vec
Al-Zoghby et al. Arabic semantic web applications: a survey
Straková et al. A new state-of-the-art Czech named entity recognizer
CN102799577B (en) A kind of Chinese inter-entity semantic relation extraction method
Bleik et al. Text categorization of biomedical data sets using graph kernels and a controlled vocabulary
Kaur et al. A survey of named entity recognition in english and other indian languages
US20170011289A1 (en) Learning word embedding using morphological knowledge

Legal Events

Date Code Title Description
C06 Publication
C10 Entry into substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)