CN102662923A

CN102662923A - Entity instance leading method based on machine learning

Info

Publication number: CN102662923A
Application number: CN2012101218391A
Authority: CN
Inventors: 张萌; 王文俊
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2012-04-23
Filing date: 2012-04-23
Publication date: 2012-09-12

Abstract

The invention belongs to the technical field of natural language treatment and entity leaning and relates to an entity instance leading method based on the machine learning. The method comprises the following steps of: carrying out linguistic data marking after document pretreatment, selecting various feathers including word feathers, word class feathers and combination feathers of words and word class feathers, and converting the linguistic data and texts to be identified into feature vector modes; carrying out the maximum entropy model training, and obtaining the maximum entropy classifier through the marked linguistic data maximum entropy model parameters; and carrying out instance extraction by the maximum entropy classifier. The method has the advantage that the entity instance can be fast and effectively leaned from a large number of texts.

Description

A kind of instances of ontology learning method based on machine learning

Affiliated technical field

The present invention relates to natural language processing and body learning technical field.Mainly be characteristics, based on the method and the experience of machine learning processing text, carry out the study of instances of ontology in the absorption natural language processing according to ontology model.

Background technology

At present, the information major part on the legacy network is non-structured, lacks the sense of organization, and has a large amount of useless, redundant information.And the volatile growth of internet information more process information, obtain knowledge and brought difficulty.Body is the clear and definite formalization normalized illustration of shared ideas model, in semantic net, is playing the part of to make interchange good between each service layer, the role of understanding.Body is that the foundation of semantic net provides knowledge base and rule base, can carry out semantic search, intelligent work based on this.Current notion, the relation that has many researchs to lay particular emphasis on how to make up body.And for the body that Primary Construction is accomplished, particularly with it in the stronger system of usability the time, how from a large amount of unstructured datas, extracting instances of ontology also is the problem that is worth thinking.On the one hand, complete body should comprise the instance of each conception of species, relation, on the other hand, constantly draws new instance and helps the perfect of ontology model, and body is developed to better direction.

At present, the research that mainly contains two aspects relates to the generation of instances of ontology: one type of work is generated as fundamental purpose with instances of ontology.These class methods are many to be core with the method for mode matching.In another kind of work, the study of instances of ontology and attribute normally realizes based on the information extraction technique of body through use.Here, instances of ontology is as the secondary product of research, and for example in the extraction system based on body, the researcher makes full use of body in the process of information extraction characteristics improve efficiency in extracting, precision, and this process finally can produce a large amount of instances of ontology.Much the extraction system based on body all adopts the GATE framework, introduces the body that has made up completion and carries out the identification of named entity.

Along with Internet fast development, quantity of information is huge day by day, and this is with regard to the study instances of ontology that proposed how from a large amount of unstructured datas robotization and the problem of attribute.And great majority use the rule-based or method of pattern match in these methods.The characteristics of these class methods are easy to understand, realize simple, quick.Meanwhile, also exist dirigibility not strong, need deficiencies such as too much artificial participation.

Summary of the invention

In order to overcome the above-mentioned deficiency of prior art; The present invention proposes a kind of method of carrying out instances of ontology study quickly and efficiently from a large amount of text datas; Form structurized information, expand instances of ontology, accomplish from the transformation of unstructured data to machine understandable structured message.Technical scheme of the present invention is following:

A kind of instances of ontology learning method based on machine learning is used for identifying the word that belongs to instances of ontology from text, and to its classification, comprises the following steps:

(1) document pre-service: extract the input of body part as subsequent step;

(2) text pre-service: the text that extracts is carried out participle, subordinate sentence processing, form the text set that has marked part of speech;

(3) mark language material: the text set that has marked part of speech is carried out the manual work mark, add type label, form mark text, i.e. language material in the back of the word that belongs to instances of ontology;

(4) feature selecting: choosing the various characteristics that comprise speech characteristic, part of speech characteristic, speech and part of speech combination of features characteristic, is the form of proper vector with language material and text-converted to be identified;

(5) maximum entropy model training: set up maximum entropy model, utilize the parameter of the good language material training maximum entropy model of mark, obtain maximum entropy classifiers;

(6) utilizing maximum entropy classifiers to carry out instance extracts: according to the characteristic that chooses; With becoming the form that sorter can be accepted through pretreated text-processing; Utilizing the maximum entropy classifiers that has trained is identification and the classification that unit carries out instance with the speech, for what identify

Instances of ontology is selected the net result of the maximum classification of probable value as concept classification under it, realizes that instance extracts.

Utilize the method for the present invention instance of study body from a large amount of texts quickly and efficiently.This based on machine learning can be from the automatic acquire knowledge of training data, thereby avoided a large amount of to the manpower studies on the natural Text Linguistics.Can more easily in every field, switch, finally serve multi-field body learning work.Simultaneously, can improve performance, meet the trend of current web high speed development, make full use of the network data resource, for the research in body field, use the data basis that provides solid through expansion to corpus.

Description of drawings

Fig. 1 general flow chart of the present invention.

Fig. 2 model training process flow diagram.H among the figure _iRepresent sorter, the subclass below the same sorter belongs to same parent.

Fig. 3 is based on the instances of ontology learning process figure of maximum entropy.

Embodiment

The present invention introduces machine learning method in learning process.Concept type in the ontology model, level are often a lot, machine learning method can handle nuance, fuzzy notion, thereby from text, extract the instance and the attribute of body effectively.

Maximum entropy is the common model in the machine learning.The main thought of maximum entropy model is to satisfy under the situation of constraint condition, chooses the distribution that makes entropy maximum.With this model is that the sorter of theoretical foundation is widely used in natural language processing, like problems such as named entity recognition, part-of-speech taggings.The principle that the use maximum entropy model carries out the entity classification is following: the contextual information of each entity is expressed as (x ₁, x ₂..., x _m), the classification under this entity is expressed as (y ₁, y ₂..., y _p).Then p (y x) is illustrated in the probability that this entity under the condition of x is classified as y.P (y x) should meet the following conditions:

p (y | x) = Z_{λ} (x) \exp (\underset{i}{Σ} λ_{i} f_{i} (x, y))

Z_{λ} (x) = \frac{1}{\underset{y}{Σ} \exp [\underset{i}{Σ} λ_{i} f_{i} (x, y)]}

Wherein, f _iRepresentation feature.λ _iBe the parameter of each characteristic, it has represented the percentage contribution of a characteristic for model.Z (x) is a normaliztion constant.In training process, model utilizes the characteristic of training data to obtain parameter value.A given new entity, model will provide the probability that this entity belongs to each type.That type that the researcher can select corresponding maximum probability as the case may be perhaps chosen the result of top as the candidate as net result.

The present invention uses the sorter based on maximum entropy, can from a large amount of texts, learn instances of ontology effectively automatically.Referring to Fig. 1, be starting point with the html document, the present invention is carried out comparatively detailed explanation, mainly comprise following step:

1, document pre-service: mainly be to resolve the html document, remove the html label, extract the input of body part as subsequent step.

2, text pre-service: the text that extracts is carried out participle, subordinate sentence processing.Participle is the unusual ring on basis in the Chinese natural language processing task, and this method is unit with the speech, adopts the ICTLCAS platform participle of Computer Department of the Chinese Academy of Science's exploitation here.Simultaneously, when pre-service, need carry out the detection of sentence boundary.According to the characteristics of Chinese text, adopt simple rule-based method to get final product, promptly survey "." "? " "! " wait sentence tail tag point commonly used.The final text set that has marked part of speech that forms.

3, learn based on the instances of ontology of maximum entropy.In this step, utilize maximum entropy classifiers to identify the entity in the sentence, and enclose tag along sort, be i.e. the class of the Ontological concept under it for it.The learning process of Ontological concept instance is similar with named entity recognition, just will in text, extract the instance that belongs to a certain notion of place name body.For example " Beijing is the capital of China ", wherein " Beijing ", " China " all are the instances that belong to this classification of geographical entity.On concrete the realization, mainly need following step:

(1) mark of corpus.Maximum entropy model belongs to the category of supervised learning, the support of the training need idiom material of model.The labeled standards of language material is by target ontology model decision, identifies the instance in the language material according to the class of notion in the body.The source of language material is web equally, at first carries out the document pre-service, on the text that extracts, marks according to specifying good standard to carry out manual work then, promptly adds type label in the back of instances of ontology.The final mark text that forms is exactly required language material.

(2) feature selecting.Characteristic can be expressed the characteristics of dissimilar instances, is the important indicator of classification, identification.When handling, need be the form of proper vector with language material and text-converted to be identified.One of advantage of maximum entropy model is the selection that only notes characteristic in use, and it is very careful that this when selecting characteristic, need also to require.Below be exactly the selected characteristic of the present invention:

A, speech characteristic: current speech.Selecting window is 2, first speech of the current speech left and right sides and about the speech of second speech itself.

B, part of speech characteristic: current part of speech.Selecting window is 2, first speech of the current speech left and right sides and about the part of speech of second speech.

C, assemblage characteristic: with current speech and part of speech, the speech itself of first speech of the left and right sides and second speech makes up respectively with part of speech in twos.

D, other supplementary features: combine the characteristic of different language material own characteristics, like suffix speech characteristic etc.

(3) maximum entropy model training.In this step, will utilize the parameter of the language material training maximum entropy model that mark is good in the step (1), finally obtain sorter.Consider the characteristics that notion is more in the ontology model, classification is thinner, in training, make full use of the sub-parent relation between Ontological concept.For the subclass under the same parent is trained same sorter.Avoid a sorter to bear excessive classification pressure like this, also good use the hierarchical structure of ontology model.Training process is as shown in Figure 2.

(4) utilizing maximum entropy classifiers to carry out instance extracts.According to the characteristic that chooses, with becoming the form that sorter can be accepted through pretreated text-processing, being utilized in the maximum entropy classifiers that has trained in the step (3) is identification and the classification that unit carries out instance with the speech.Sorter can provide the probable value that one group of current speech belongs to each candidate's classification.The pairing probable value of word that does not belong to any instances of ontology should be zero.For the instances of ontology that identifies, select the net result of the maximum classification of probable value as concept classification under it.

4, instances of ontology notion mapping.In the step, we have utilized sorter to extract the instance in the text on this.In this step, these example map to corresponding concept, and are preserved with the form of owl.Fig. 3 is the instances of ontology learning process figure based on maximum entropy.

Method among utilization the present invention can be directed against the target body, from the given corresponding instances of ontology of text focusing study.Here, be how the example explanation uses this method study instances of ontology with the Chinese Place Names ontology model.Language material source is the Chinese edition wikipedia china administration zoning page, totally 500 pieces of articles, and wherein 400 pieces as corpus, and 100 pieces as testing material.According to the characteristics of the relevant geography information of " wikipedia " china administration zoning, select the emphasis of this classification of geographical entity as identification, classification.Comprise classifications such as political geography entity, physical geography entity under " geographical entity " this notion, it comprises multiple different classes again down, is the emphasis of classification.And most place name relations occur between the geographical entity, therefore it are weighed the emphasis of machine learning method as utilization.Have the instance of 50 types of geographical entities in the corpus, part type statistics information wherein is following:

Table 1 political geography entity statistical form

Table 2 physical geography entity statistical form

1. at first 500 pieces of documents are carried out pre-service.

2. choosing 400 pieces marks as language material.Corpus labeling is following:

[Shizhu County] DLST-RWDLST-XZQY-SJXZQY-EJXZQYZZX (formerly known as "stone Zhu County," 1959 called "Shizhu " [1]) in [China] DLST-RWDLST-XZQY-GJ [Chongqing ]

The DLST-RWDLST-XZQY-YJXZQY-YJXZQYCSX central and east; Border on [the Changjiang river] DLST-ZRDLST-SX-HL in the west; East is adjacent with [Hubei Province] DLST-RWDLST-XZQY-YJXZQY-YJXZQYPTX; Apart from 321 kilometers in Chongqing, be nearest one in middle distance Chongqing, four ethnic minority autonomous counties of having under its command, Chongqing.Have jurisdiction over 12 towns, 20 townshiies.[yellow water] DLST-ZRDLST-SX-HL [National forest park] DLST-RWDLST-JNLYD-GY [Wan Shouzhai] DLST-RWDLST-JNLYD-FJMSQ, accurate "Danxia" landform view [foot bath small stream] DLST-RWDLST-JNLYD-FJMSQ [western Tuo Yuntijie] DLST-RWDLST-JNLYD-FJMSQ

3. according to the characteristics of place name, on the basis of essential characteristic, optionally add suffix speech characteristic:

(1) humane entity suffix speech characteristic: whether current speech has comprised the speech in the suffix dictionaries such as " economizing city, county ".If current speech comprise in the suffix dictionary speech then eigenwert be made as 1, otherwise be 0.

(2) natural entity suffix speech characteristic: whether current speech has comprised the speech in the suffix dictionaries such as " mountain, river, lake, seas ".If current speech comprise in the suffix dictionary speech then eigenwert be made as 1, otherwise be 0.

4. model training stage.Sorter of initial use is only distinguished two kinds, i.e. two classes of the superiors of Ontological concept: political geography entity and physical geography entity.

5. model measurement.Choose 100 pieces of remaining documents and do participle, subordinate sentence processing.Become the input form of machine learning algorithm needs according to characteristic processing.Sending into sorter classifies.Following table is a classification performance:

Two types of classifying qualities of table 4

The present invention has obtained gratifying effect on this language material, and is easier to migrate in the new field compared to rule-based method.Along with being on the increase of language material, also can effectively improve accurate rate, recall rate.Can tackle the application in the open field of web rank.

Claims

1. the instances of ontology learning method based on machine learning is used for identifying the word that belongs to instances of ontology from text, and to its classification, comprises the following steps:

(1) document pre-service: extract the input of body part as subsequent step;

(5) maximum entropy model training.Set up maximum entropy model, utilize the parameter of the good language material training maximum entropy model of mark, obtain maximum entropy classifiers;

(6) utilizing maximum entropy classifiers to carry out instance extracts: according to the characteristic that chooses; With becoming the form that sorter can be accepted through pretreated text-processing; Utilizing the maximum entropy classifiers that has trained is identification and the classification that unit carries out instance with the speech; For the instances of ontology that identifies, select the net result of the maximum classification of probable value as concept classification under it, realize that instance extracts.