CN101404033A

CN101404033A - Automatic generation method and system for noumenon hierarchical structure

Info

Publication number: CN101404033A
Application number: CNA2008102263909A
Authority: CN
Inventors: 穗志方; 赵庆亮
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2008-11-14
Filing date: 2008-11-14
Publication date: 2009-04-08

Abstract

The invention relates to an automatic generation method of an ontology hierarchical structure. The method comprises the following steps: S1. an attribute value list of each concept is extracted based on the Internet; S2. similar attribute values in the attribute value list are merged; S3. the attribute values in the attribute value list are filtered according to the domain feature of the concept; and S4. the conceptual hierarchical structure in the ontology is automatically generated by the merged and filtered attribute values. The invention also relates to a corresponding system. The method takes weighted attribute values as the characteristic vectors of the concepts and utilizes the clustering algorithm to cluster the concepts, which can greatly improve the accuracy rate of the results, thus causing the automatic generation of a large-scale and practical Ontology to be possible.

Description

The automatic generation method and system of noumenon hierarchical structure

Technical field

The present invention relates to Internet technical field, relate in particular to the automatic generation method and system of a kind of Ontology (body) hierarchical structure.

Background technology

Ontology is a kind of semantic basis that exchanges (dialogue, interoperability, share etc.) in certain field between (can be in the application-specific, also can be wider scope) different subjects (people, machine, software systems etc.).The structure work of early stage Ontology expends great amount of manpower and material resources and financial resources by manually finishing, and the time cycle is also very long, has influenced the application of Ontology to a great extent.Over nearly 30 years, the researchist concentrates on energy on automatic, the semi-automatic structure of Ontology, has obtained a lot of achievements.The important component part that Ontology makes up automatically is the automatic generation of concept hierarchy structure, and the hierarchical structure of notion is the basic framework that Ontology carries out knowledge organization, is that Ontology makes up most crucial content automatically.The concept hierarchy structure automatic generating calculation of efficiently and accurately has basic meaning for automatic structure extensive, practicality Ontology, makes next generation internet based on Ontology simultaneously--the realization of the Semantic Web possibility that becomes.

Current Ontology hierarchical structure generates automatically and mainly contains based on Pattern, based on the method for FCA, cluster, and these methods exist the not high drawback of accuracy, become a bottleneck problem in the automatic building process of Ontology.

Summary of the invention

The objective of the invention is to overcome that noumenon hierarchical structure generates the not high problem of method accuracy automatically in the prior art.

In order to achieve the above object, technical scheme of the present invention proposes a kind of automatic generation method of noumenon hierarchical structure, and this method may further comprise the steps:

S1. extract the list of attribute values of each notion based on the internet;

S2. property value similar in the described list of attribute values is merged;

S3. according to the field characteristic of notion the property value in the described list of attribute values is filtered;

S4. utilize described merging, the property value after filtering carries out the automatic generation of concept hierarchy structure in the body.

In the automatic generation method of above-mentioned noumenon hierarchical structure, described step S1 specifically comprises:

S111. use " class name+attribute+subset " internet to be retrieved the saving result webpage as key word;

S112. described results web page is carried out denoising;

S113. carrying out sentence according to the webpage of preset condition after to described denoising selects;

S114. extraction and described subset are in the phrase in the parallel construction, and calculate weight;

S115. whether the weights that obtain of determining step S114 are higher than preset threshold value, if then add list of attribute values and change step S113 as new property value, otherwise stop.

In the automatic generation method of above-mentioned noumenon hierarchical structure, the pre-conditioned of described step S113 comprises:

Comprise parallel construction in the sentence;

The seed property value appears in the described parallel construction.

S121. read the sentence that described step S113 produces, described parallel construction in pre-conditioned is replaced into sky;

S122. according to the feature of sentence among the default feature templates extraction step S121;

Adopting comparatively simple feature templates herein, is in order to guarantee that feature space does not too disperse.

Wherein, i is from the 0 length l ength-1 to sentence, word _iEach speech in the expression sentence, word _I-1+ word _iRepresent two tuples that previous speech and current speech constitute; Pos _iThe part of speech of each speech in the expression sentence, the implication of part of speech combination expression is identical with contamination.

S123. the sentence feature that generates according to step S122 uses the training of maximum entropy instrument to generate sorter;

S124. use " class name+attribute " internet to be retrieved the saving result webpage as key word;

S125. after the results web page that step S124 is obtained is carried out denoising, use the sorter that generates among the step S123 to classify to each sentence in the webpage;

S126. in the described sorted relevant sentence of step S125, carry out word frequency statistics;

S127. word frequency is the highest several are as subset, repeating said steps S111～S115.

In the automatic generation method of above-mentioned noumenon hierarchical structure, described step S2 specifically comprises:

S21. generate the global property value list;

S22. extract the context that each attribute occurs in the list of attribute values;

S23. extract the characteristic set of property value according to default feature templates;

Characteristic set is as follows:

Adopt speech and part of speech as feature herein, the feature space dimension is reduced.

Wherein, i is from the 0 length l ength-1 to sentence, word _iEach speech in the expression sentence, pos _iThe part of speech of each speech in the expression sentence.

S24. structural attitude vector, and with vector distance, similarity of character string as the standard of weighing two concept similarities;

S25. use nearest neighbor algorithm to carry out cluster.

In the automatic generation method of above-mentioned noumenon hierarchical structure, described step S3 specifically comprises:

Use word frequency/reverse file frequency algorithm to filter ordinary speech.

In the automatic generation method of above-mentioned noumenon hierarchical structure, the execution of described step S2 and step S3 in no particular order.

Technical scheme of the present invention also proposes a kind of automatic creation system of noumenon hierarchical structure, comprising:

The automatic extraction module of property value extracts the list of attribute values of each notion based on the internet;

Property value cluster module merges property value similar in the described list of attribute values;

The property value filtering module filters the property value in the described list of attribute values according to the field characteristic of notion;

Concept hierarchy structure generation module utilizes described merging, the property value after filtering carries out the automatic generation of concept hierarchy structure in the body.

Technical scheme utilization of the present invention has the proper vector of the property value of weight as notion, use clustering algorithm that notion is carried out cluster, can improve result's accuracy rate significantly, thereby make that structure Ontology (body) extensive, practicality becomes possibility automatically.

Description of drawings

Fig. 1 implements illustration for the automatic creation system of noumenon hierarchical structure of the present invention;

Fig. 2 is the general frame figure of the automatic extraction module of property value among Fig. 1 embodiment;

Fig. 3 is the weak guidance method synoptic diagram of the automatic extraction module of property value among Fig. 2;

Fig. 4 is the no guidance method synoptic diagram of the automatic extraction module of property value among Fig. 2;

The tree-like construction that Fig. 5 obtains in medical domain for Fig. 1 embodiment is figure as a result.

Embodiment

Following examples are used to illustrate the present invention, but are not used for limiting the scope of the invention.

Fig. 1 implements illustration for the automatic creation system of noumenon hierarchical structure of the present invention, as shown in the figure, the system of present embodiment comprises the automatic extraction module 101 of property value, property value cluster module 102, property value filtering module 103 and concept hierarchy structure generation module 104, below will be described respectively.

1) the automatic extraction module 101 of property value

Module effect: the list of attribute values of extracting each term based on WWW automatically

Based on the automatic extending method of the extensive Ontology of WWW, its input is notion title, Property Name and the property value subset among the Ontology, extracts and the property value extraction through relevant sentence, obtains the candidate attribute value set, and its general frame as shown in Figure 2.As can be seen from Figure 2, the notion title and the seed that are input as among the Ontology of system are gathered, and the module by the structure inquiry obtains the inquiry that is made of notion title+Property Name and seed; For the retrieval internet module, obtain the set of related web page afterwards; Be the sentence sort module afterwards, obtain the set of relevant sentence; Last use attribute extraction module extracts property value set from relevant sentence, thereby finishes the task that property value extracts automatically from the internet.

Use Google API as the instrument that obtains original web page.Choose that preceding 100 webpages of correlativity rank carry out information extraction in the result for retrieval, before using webpage, also need original web page is carried out denoising, at first construct the Dom tree of webpage, the proportion of the link under statistics leaf Table label and the leaf Div label, if the proportion of link surpasses 50%, think that then this piece is that noise removes.

The property value that divides two steps to carry out specified concept in the webpage extracts: a. seeks relevant sentence from webpage; B. in relevant sentence, seek property value.

For step a, the relevant sentence of definable is:

Comprise parallel construction (condition 1) in the sentence;

Seed property value (condition 2) appears in the parallel construction.

Under the situation that is implemented in given few property value of trying one's best, can obtain enough relevant sentences, and the final accurately complete property value set that obtains, present embodiment also proposes sentence and selects to extract interactive method with property value.Specific algorithm is as follows: the new property value that finds in sentence is added to during subset is fated, so just can find more sentence, and can extract more property value from more sentence.Carry out the evaluation of candidate attribute value simultaneously, the candidate attribute value of having only degree of confidence to be higher than certain threshold value just can be adopted.Whole process ends at no longer to produce new candidate attribute value, as shown in Figure 3, may further comprise the steps:

S301. use " class name+attribute+subset " internet to be retrieved, preserve related web page as key word;

S302. for the results web page denoising;

S303. carrying out sentence according to above-mentioned condition 1,2 selects;

S304. extraction and subset are in the phrase in the parallel construction, and calculate its weight according to above-mentioned formula;

If S305. have new property value to occur then change step S303, otherwise stop.

For the evaluation of candidate attribute value, can use x ²Calculate the weight of each phrase in the parallel construction, it is added in the subset if weight is higher than preset threshold.Yet the phrase arranged side by side with the high seed of degree of confidence should have high confidence level; The phrase arranged side by side with the low seed of degree of confidence then should have low confidence.So when a phrase and seed attribute appear in the parallel construction simultaneously, also should add the weight of seed property value here, computing formula is as follows:

Use x ²As initial weight:

Wherein,

m_{i, j} = \frac{\underset{i}{Σ} {freq}_{i, j} \underset{j}{Σ} {freq}_{i, j}}{\underset{ij}{Σ} {freq}_{i, j}}

Iterative formula:

{weight}_{phrase} = {weight}_{phrase} + Σ_{0}^{m} {weight}_{{phrase}_{m}}

Wherein, phrase is an object phrase, weight _PhraseBe its weight; Phrase _mFor appearing at phrase in the parallel construction with phrase, Be its weight.Phrase _mIt is kind subphrase with the object phrase co-occurrence.The weight that is object phrase be the initial weight genitive phrase that is in coordination with it weight add and.

The method that proposes above (calling weak guidance method in the following text) has solved certain attribute when target class and has had under the situation of seed property set the filling to this attribute.But the difficulty that this method faces is, when need be to the automatic structure of extensive Ontology, all specify a property value subset can not for each attribute of each class.In order to address this problem, guideless property value extracting method has below further been proposed, thereby can be implemented in the process of filling whole Ontology, need specify a property value seed set for certain attribute of a class, just can finish the automatic filling of the property value of other all these attributes of class for artificial.

In the process that weak guidance method is filled a class, obtain the set of a sentence, if can find out certain pattern at these sentences, just can utilize these patterns to judge whether a sentence is the sentence of describing a certain generic attribute, thereby substitute two Rule of judgment 1,2 in the weak guidance method.For improving the correctness of property value set, only select those high confidence level candidate attribute values here, they as subset, just can be finished property value and extract automatically according to weak guidance method.The synoptic diagram of the property value extraction method under nothing instructs may further comprise the steps as shown in Figure 4:

S401. read the relevant sentence that produces in the weak guidance method attribute filling process, wherein parallel construction is replaced into sky;

S402. extract the feature of sentence among the S401 according to feature templates;

S403. the sentence feature that generates according to S402 uses the training of maximum entropy instrument to generate sorter;

S404. use " class name+attribute " internet to be retrieved, preserve related web page as key word;

S405. for after the results web page denoising, use the sorter that generates among the S403 to classify to each sentence in the webpage;

S406. in the classification of S405, carry out word frequency statistics for relevant sentence;

S407. word frequency is the highest 3 as seed set, the process a little less than the repetition in the guidance method.

Process in the weak guidance method of foundation can obtain a set of relevant sentence, and note is made TrainingSentenceSet.Parallel construction in each sentence among the SentenceSet is replaced with variable, for example: " ... symptom has: cough, fever etc.. " replace with: " ... symptom has: $ etc.. ", the result who obtains is this sentence and the irrelevant part of concrete condition, can embody the sentence pattern of current relation on attributes, training obtains a sorter as training examples with them then.The feature templates of selecting in the present embodiment is as shown in table 1.

Table 1 sentence pattern sorter feature templates

Wherein, i represents from the 0 length l ength-1 to sentence, word _iEach speech in the expression sentence, word _I-1+ word _iRepresent two tuples that previous speech and current speech constitute, word _I-1+ word _i+ word _I+1Represent the tlv triple that previous speech, current speech and a back speech constitute; Pos _iThe part of speech of each speech in the expression sentence, the implication of part of speech combination expression is identical with contamination.

Because it is limited to describe the sentence formula of attribute,, the processing of problem is had a negative impact if select for use too many template can cause data sparse.Through a series of experiments, the final feature templates of selecting of present embodiment is the template 1,2,3,4 in the table 1.Use maximum entropy algorithm to train a sorter.Still the algorithm of continuing to use in the weak guidance method is estimated the candidate attribute value, has so just obtained the seed set, carries out the attribute filling process with weak guidance method again.

In the foregoing description, no guidance method is to the further modification of weak guidance method and must be based upon on the basis of weak guidance method, can regard as: in use, a property value subset of the given notion of elder generation, extract its property value with weak guidance method, extract the sentence set that comprises these property values simultaneously; In the time will extracting the property value of other notions, because can not give property value subset of each notion, so, the sentence that obtains with weak guidance method can be gathered as training set, adopt no guidance method further to extract the property value set of other notions.

2) property value cluster module 102

The effect of this module is to merge similar property value, thereby plays the effect of dimensionality reduction, and improves accuracy.Specific algorithm can be expressed as follows with false code:

I. the property value of each notion is merged into a table, generate the global property value list, the global property value list is " property value, a weight " table of comparisons, concrete construction process is as follows: for the property value of each notion, if a certain property value does not occur in global listings, then add one, and be the weight of this property value weight setting; Otherwise, revise weight, recruitment is the weight of this property value;

Ii. extract the context that each attribute occurs in the list of attribute values, the contextual sentence that this property value occurs that is defined as;

Iii. extract the characteristic set of property value according to certain feature templates, feature templates is as follows:;

Iv. structural attitude vector, and with vector distance, similarity of character string as the standard of weighing two concept similarities;

V. use nearest neighbor algorithm to carry out cluster, the clustering algorithm comparative maturity repeats no more herein.

3) the property value filtering module 103

Since pending Ontology to as if the field relevant, so the property value in the category of term field should can reflect the feature of a notion more, the territoriality of judging property value is mainly based on following means:

Use TF/IDF (Term Frequency/Inverse Document Frequency, word frequency/reverse file frequency) in order to filter ordinary speech.

Increase the weight of those property values relevant with the field according to above means.

4) concept hierarchy structure generation module 104

At first need to define the similarity of two notions, two notions of definition are similar in the present embodiment, based on following 2 hypothesis:

If a. two notions are similar more on its determinant attribute, these two notions are just similar more so

If b. the similar attribute of two notions is many more, these two notions are just similar more so

For hypothesis a, need what attribute of definition to be only the determinant attribute of notion, selection by automatic extraction of property value and property value, we can obtain the weight of certain each property value of notion, it has just embodied the tight ness rating of the contact between property value and the notion, and the higher property value of those weights is crucial more.

For hypothesis b, by the unified weight of amplifying, such two notions if identical property value even the property value weight is very little, can not be left in the basket yet.

Will weigh two hypothesis simultaneously, concrete formula is as follows:

Dis \tan ce ({Concept}_{1}, {Concept}_{2}) = ΣΣ \sqrt{{weight}_{i}} \sqrt{{weight}_{j}}

Owing to did normalized before the weight of property value, so evolution promptly can reach the effect of amplification.

Next, promptly use the KNN algorithm that notion is carried out cluster.For certain notion A, seek N the notion the most similar (calculating of two concept similarities see on) to it, this N+1 notion is aggregated into one bunch, bunch core be bunch in the merging of each vector, with whole bunch is to regard a notion as, use notion in the core representative bunch to participate in the cluster of a new round, specific algorithm can be expressed as follows with false code:

I. take out a notion A in the list of concepts;

Ii. seek a N notion the most close with A;

Iii. this N+1 notion added a S set;

The proper vector of iv.S be among the S each elemental characteristic vector and;

V. the result after will merging repeats i～iv as input, gathers all elements that comprises in the list of concepts up to S.

Finally can obtain one comprise all notions, be node with the notion, by hyponymy with these concept structures tree structure together.With the medical domain is example, as shown in Figure 5.

More than be preferred forms of the present invention, according to content disclosed by the invention, those of ordinary skill in the art can expect some identical, replacement schemes apparently, all should fall into the scope of protection of the invention.

Claims

1, a kind of automatic generation method of noumenon hierarchical structure is characterized in that this method may further comprise the steps:

S1. extract the list of attribute values of each notion based on the internet;

S2. property value similar in the described list of attribute values is merged;

2, the automatic generation method of noumenon hierarchical structure according to claim 1 is characterized in that described step S1 specifically comprises:

S112. described results web page is carried out denoising;

3, as the automatic generation method of noumenon hierarchical structure as described in the claim 2, it is characterized in that the pre-conditioned of described step S 113 comprises:

Comprise parallel construction in the sentence;

The seed property value appears in the described parallel construction.

4, as the automatic generation method of noumenon hierarchical structure as described in the claim 3, it is characterized in that described step S1 specifically comprises:

5, the automatic generation method of noumenon hierarchical structure according to claim 1 is characterized in that described step S2 specifically comprises:

S21. generate the global property value list;

S25. use nearest neighbor algorithm to carry out cluster.

6, the automatic generation method of noumenon hierarchical structure according to claim 1 is characterized in that described step S3 specifically comprises:

Use word frequency/reverse file frequency algorithm to filter ordinary speech.

7, as the automatic generation method of noumenon hierarchical structure as described in claim 5 or 6, it is characterized in that the execution of described step S2 and step S3 in no particular order.

8, a kind of automatic creation system of noumenon hierarchical structure is characterized in that, this system comprises: