CN103324700A

CN103324700A - Noumenon concept attribute learning method based on Web information

Info

Publication number: CN103324700A
Application number: CN2013102292298A
Authority: CN
Inventors: 王俊丽; 王志成; 赵卫东; 梁梅连
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2013-06-08
Filing date: 2013-06-08
Publication date: 2013-09-25
Anticipated expiration: 2033-06-08
Also published as: CN103324700B

Abstract

The invention relates to the field of noumenon learning, in particular to a noumenon concept attribute learning method based on Web information. By means of the technical scheme, Web serves as a language database, a language pattern is built to serve as a query set of a Google search engine, webpage fragment and corresponding source website URL are extracted to build a candidate concept attribute word bank, an URL built text set of candidate works serves as LAD input, training parameters of an LDA model are obtained by adopting a Gibbs sampling method, an attribute candidate bank is trimmed and combined according to operation results of the LDA model, and a final concept attribute work set can be determined. The noumenon concept attribute learning method of Web information can accurately and effectively obtain a concept attribute set in a noumenon, and accordingly, automatic or semiautomatic body building can be possible.

Description

A kind of Ontological concept attribute learning method based on Web information

Technical field

The present invention relates to body learning technology and Internet technical field, specially refer to a kind of Ontological concept attribute learning method based on Web information.

Background technology

Semantic Web directly is the hot fields of computer research, its research emphasis mainly be around how the information table among the Web be shown machine the form that can understand and handle, namely have semanteme.Body is described the modeling tool of conceptual model as kind of energy at semantic and knowledge hierarchy, is core and the key of semantic description in the Semantic Web.At present, body is as providing the valuable source of domain knowledge support to be widely used in the various Intelligent Information Processing tasks such as knowledge engineering, information retrieval, question answering system.

Body learning is automatically or semi-automatically to obtain the ontology knowledge of expectation by technology such as machine learning, statistical method and natural language processings from existing data resource.Owing to realize that knowledge acquisition technology is still unrealistic fully automatically, so body learning is an automanual process of carrying out down user guided usually.

In the ontology conceptual knowledge is built, when describing a certain conceptual model, not only to provide the concept noun, and will provide the objective attributes of entities that concept reflects and describe, claim that these attributes are concept attribute.The body attribute is as the important component part of domain body construction of knowledge base and application, it is the emphasis of a basic research job of the automatic or semi-automatic structure of domain body knowledge base, correlative study both at home and abroad at present mainly concentrates on the extraction of Ontological concept example and attribute, or the right extraction of concept attribute and property value, and obtained certain progress.The research method of Ontological concept attributes extraction mainly is divided three classes:

Rule-based method: it is at first constructed based on the pattern rules set of word, part of speech and semanteme and them and stores.When attributes extraction, use the pattern of storing in statement fragment that linguistic knowledge handles desire and the pattern rules set to mate, if the match is successful, think that then this statement has the relation of corresponding pattern.Rule-based method needs the domain expert to participate and draw pattern rules, and the method costs dearly, and lacks the field portability;

Machine learning method based on statistics: the method based on the machine learning of adding up is a kind of method of carrying out widespread use in the concept attribute leaching process present stage.The language material that at first utilizes machine learning algorithm manually to mark is trained to a sorter model, then the sorter that makes up is used for the not prediction of the language material of mark is realized predefined classification is identified.The current use of this method is more extensive, has also obtained objective achievement.

Method based on semi-structured/structured data document: therefrom extracting concept attribute by analysis half hitch structure/structured data document structure also is to carry out a kind of main method that concept attribute extracts now.But the weak point of this method is that it is adapted to the relatively more fixing and complete document of document format, lacks generalization ability.

Summary of the invention

Purpose of the present invention provides a kind of Ontological concept attribute learning method based on Web information, in conjunction with carrying out the study of Ontological concept attribute based on the linguistics pattern with based on technology such as probability statistics, the concept attribute that the LDA model is applied to body is chosen the stage, generates the Ontological concept attribute more accurately and effectively to reach.

In order to reach the foregoing invention purpose, the present invention propose kind rule-based and machine learning, carry out the study of Ontological concept attribute with the irrelevant mixed method of file structure, adopt vocabulary-syntactic pattern to make up set of patterns, carrying out candidate's concept attribute word with Web as corpus extracts, and make up text set as the input of LDA model according to extracting the result, utilize the Gibbs sampling to obtain the training parameter of LDA model, body candidate concept attribute dictionary is pruned and merged according to extracting the result behind the operation LDA model, obtain final concept attribute set.

The present invention provides following technical proposals:

A kind of Ontological concept attribute learning method based on Web information is characterized in that, comprises the steps:

(1) structure of vocabulary-syntactic pattern collection.According to existing basic language set of patterns, utilize vocabulary-semantic pattern to make up and merge the verb form augmented pattern collection of expression relation of inclusion, the final set of patterns of expressing concept attribute of setting up is as the part of candidate's concept attribute extraction algorithm input.

(2) structure in candidate's concept attribute storehouse.Search plain engine as Web Data Source (corpus) with Google, at first make up the language mode collection, as the inquiry input of Google, extract corresponding webpage query fragment set and source network address URL set.The web page fragments that obtains according to inquiry then obtains candidate attribute word (the word frequency rate is more high, for the possibility of attribute word more big) according to word frequency statistics, just can obtain candidate's concept attribute word set through simple screening.

(3) structure of text set.According to the attribute word in candidate's dictionary, keep its corresponding source network address and carry out the webpage extraction.To the web document set of extracting, adopt the instrument of the increasing income OpenNLP composition notebook pre-service of Apache, mainly be to make the part of speech mark with OpenNLP.

(4) LDA prunes and merges the concept attribute collection.According to the text set of input, in conjunction with the result of Gibbs sampling parameter estimation, operation LDA model.Extraction result according to LDA models for several times iteration prunes and merging candidate concept attribute dictionary, obtains final concept attribute set.

In the above-mentioned Ontological concept attribute learning method, described step specifically comprises in (2):

1) according to each the pattern p among the set of patterns P _i, in Google, carry out each inquiry p respectively _i

2) to each inquiry p _iEach n among the total Query Result number of pages N that returns, if (Query Result is included in＜em〉＜/em〉in the label), corresponding web page fragments S then extracted _iWith source network address (URL) U that extracts correspondence _i, all inquire about up to set of patterns P and to finish;

3) each concentrated fragment S of web page fragments _i, make word frequency statistics C _WiWith the non-noun W of rejecting _n

In the above-mentioned Ontological concept attribute learning method, described step specifically comprises in (3):

1) each U that URL is concentrated _i, extract corresponding web page contents and save as document d _i

2) to each the document d among the document sets D _i, do pre-service with OpenNLP;

3) if w _iPart of speech be NN/NNS/NNP/NNPS, extract word w _i, up to handling document sets D.

In the above-mentioned Ontological concept attribute learning method, described step specifically comprises in (4):

1) at subject layer, to each the descriptor z among the theme word set T, extracts hybrid parameter

2) in document level, to every piece of document d among the document sets D, extract hybrid parameter

Be worth as document length with extraction from Poisson distribution is individual, i.e. the length N of every piece of document _d: Poiss (ξ);

3) 2) word layer under the condition, to word set N among the document d _dIn each word n, extract theme

With extraction term word

4) continuous repeating step 1), 2), 3) three steps constitute generative process at random, up to D piece of writing document is all traveled through.

Technical scheme of the present invention is utilized in the process of Web as corpus solution pattern learning and the sparse problem of data often occurred, use the LDA model to prune and merge candidate's concept attribute dictionary, can improve the accuracy rate of extracting the result significantly, thereby make that constructing body semi-automatedly becomes possibility, lay the foundation for robotization makes up body.

Description of drawings

Fig. 1 is the model support composition of Ontological concept attribute study of the present invention;

Fig. 2 is the general frame figure of Fig. 1 model support composition;

Fig. 3 is LDA structure of models figure among Fig. 2;

The attributes extraction that Fig. 4 obtains in the car field for Fig. 1 model support composition is figure as a result.

Embodiment

Shown in the model support composition of Fig. 1, comprise the steps: according to the Ontological concept attribute learning method of the specific embodiment of the invention

1) vocabulary-syntactic pattern collection makes up module

Model function: therefore the language mode collection need at first make up set of patterns as the necessary input of Google inquiry.

According to present existing natural language processing technique, structure by pattern match, is identified interested relation in the text based on word, part of speech and semantic pattern rules set (being language mode).Research a kind of language mode---vocabulary-syntactic pattern (lexical-syntactic patterns) in the present embodiment, according to existing basic language set of patterns, utilize vocabulary-semantic pattern to make up and merge the verb form augmented pattern collection of expression relation of inclusion, the final set of patterns of expressing concept attribute of setting up is as the part of candidate's concept attribute word extraction algorithm input.

The implication of vocabulary-syntactic pattern can be found out from following Example intuitively: establishing target strings is cdabfdbab, and pattern string is ab, and put the first place that then finds substring identical with pattern string in the target strings after the pattern match is 3 and 8.Selecting car in the present embodiment is the concept theme, and its concept attribute detecting pattern is as shown in table 1.

Table 1 concept attribute detecting pattern

Wherein, the NP in the common-mode can be any concept noun (being car in the present embodiment), and the black runic word in is exactly the attribute candidate word of car for example.

2) candidate's concept attribute dictionary makes up module

The module effect: the candidate's concept attribute based on Web extracts, and sets up candidate's concept attribute dictionary.

Search plain engine as Web Data Source (corpus) with Google, with the inquiry input of language mode collection as Google, extract corresponding webpage query fragment set and source network address URL set.The web page fragments that obtains according to inquiry then obtains candidate attribute word (the word frequency rate is more high, for the possibility of attribute word more big) according to word frequency statistics, just can obtain candidate's concept attribute word set through simple screening.

According to the extraction result of candidate's concept attribute extraction algorithm, the word frequency result of the part web page fragments of extracting in the present embodiment, candidate attribute word and corresponding attribute word thereof is as shown in table 2.

Table 2 part webpage extracts example as a result

In the present embodiment, the employing language mode is carried out the extraction of candidate attribute word in Web after, because the concept attribute word all is the noun part of speech, therefore the word of rejecting non-noun part of speech finally obtains a candidate attribute dictionary.

3) text set makes up module

Model function: the candidate attribute dictionary can not be asserted final attribute word set, also needs to use the progress of LDA model to extract the relevant word of concept attribute.Text set is the individual important input of LDA model.

In candidate's concept attribute leaching process of above-mentioned Web, not only can obtain the candidate attribute dictionary, can also obtain source network address set.According to the attribute word in candidate's dictionary, keep its corresponding source network address and carry out the webpage extraction.

To the web document set of extracting, adopt the instrument of the increasing income OpenNLP of Apache to do basic pre-service, as part-of-speech tagging etc.The text set of forming with noun is as the part of LDA model input.Like this, in conjunction with the result of Gibbs sampling parameter estimation, just can use the LDA model to do the attribute word and extract.

4) the LDA model is pruned and is merged the candidate attribute library module

Module effect: candidate's concept attribute dictionary is pruned and merged with the extraction result of LDA model, improve the accuracy rate of attribute learning outcome.Specific algorithm can be expressed as follows with false code:

I. at subject layer, to each the descriptor z among the theme word set T, be from one be that the Multinomial that extracts the Dirichlet prior distribution of β distributes from parameter, namely extract hybrid parameter

Ii to every piece of document among the document sets D, extracts a value as document length, i.e. the length N of every piece of document in document level from Poisson distribution _d: Poiss (ξ), from the Dirichlet prior distribution that a parameter is α, extract again and there emerged a the Multinomial distribution as the probability that occurs word under each theme inside the document d, namely extract hybrid parameter

Iii. the word layer under the ii condition namely for n word among the document d, extracts a theme during the Multinomial that at first occurs word under each theme from the document distributes

And then in the Multinomial of the word of this theme correspondence distributes, extract a word as document d in word set N _dIn each word n, namely extract the term word

Iv. the continuous generative process at random that constitutes of repeating step i, ii, three steps of iii is up to D piece of writing document is all traveled through.

In the above-mentioned algorithm, w is observation data,

θ and z are latent variables to be estimated, α and β be respectively in the model constant super parameter and With the Dirichlet priori on the θ, concrete variable information is as shown in table 3.

Table 3 LDA Model parameter implication

Finally, operation LDA model is example with car, and it is as shown in table 4 to obtain extracting the result.According to the Ontological concept attribute learning method based on Web information that proposes, extracted the concept attribute word set in this field in the present embodiment.

Claims

1. plant the Ontological concept attribute learning method based on Web information, it is characterized in that, comprise the steps:

(1) structure of vocabulary-syntactic pattern collection:

According to existing basic language set of patterns, utilize vocabulary-semantic pattern to make up and merge the verb form augmented pattern collection of expression relation of inclusion, the final set of patterns of expressing concept attribute of setting up is as the part of candidate's concept attribute extraction algorithm input;

(2) structure in candidate's concept attribute storehouse:

Search plain engine as the Web Data Source with Google, at first make up the language mode collection, as the inquiry input of Google, extract corresponding webpage query fragment set and source network address URL set; The web page fragments that obtains according to inquiry obtains the candidate attribute word according to word frequency statistics then, just obtains candidate's concept attribute word set through screening;

(3) structure of text set:

According to the attribute word in candidate's dictionary, keep its corresponding source network address URL and carry out the webpage extraction; To the web document set of extracting, adopt the instrument of the increasing income OpenNLP composition notebook pre-service of Apache, make the part of speech mark with OpenNLP;

(4) LDA prunes and merges the concept attribute collection:

According to the text set of input, in conjunction with the result of Gibbs sampling parameter estimation, operation LDA model; Extraction result according to LDA models for several times iteration prunes and merging candidate concept attribute dictionary, obtains final concept attribute set.

2. the Ontological concept attribute learning method based on Web information as claimed in claim 1 is characterized in that described step specifically comprises in (2):

2) to each inquiry p _iEach n among the total Query Result number of pages N that returns is if Query Result is included in＜em〉＜/em〉in the label, then extract corresponding web page fragments S _iWith the source network address U that extracts correspondence _i, all inquire about up to set of patterns P and to finish;

3. the Ontological concept attribute learning method based on Web information as claimed in claim 1 is characterized in that described step specifically comprises in (3):

4. the Ontological concept attribute learning method based on Web information as claimed in claim 1 is characterized in that described step specifically comprises in (4):

With from Poisson distribution, extract a value as document length, i.e. the length N of every piece of document _d: Poiss (ξ);

With extraction term word