CN107463607A

CN107463607A - The domain entities hyponymy of bluebeard compound vector sum bootstrapping study obtains and method for organizing

Info

Publication number: CN107463607A
Application number: CN201710484051.XA
Authority: CN
Inventors: 郭剑毅; 马晓军; 余正涛; 陈玮; 张志坤
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2017-06-23
Filing date: 2017-06-23
Publication date: 2017-12-12
Anticipated expiration: 2037-06-23
Also published as: CN107463607B

Abstract

The present invention relates to a kind of domain entities hyponymy acquisition of bluebeard compound vector sum bootstrapping study and method for organizing, belong to natural language processing and machine learning techniques field.The present invention is first according to the mode of bootstrapping study, the hyponymy example of candidate is obtained from the text of tour field, using the hyponymy example of candidate, artificial constructed tour field knowledge base, use for reference mapping matrix and hierarchical relationship tissue is carried out to candidate's hyponymy example.The present invention realizes effective extraction to hyponymy, strong support is provided for work such as information extraction, information retrieval and machine translation, compared with current recognition methods, accuracy of the invention, recall rate, F values are improved, therefore the present invention has certain Research Significance.

Description

The domain entities hyponymy of bluebeard compound vector sum bootstrapping study obtains and tissue Method

Technical field

The present invention relates to the domain entities hyponymy acquisition of bluebeard compound vector sum bootstrapping study and method for organizing, belong to Natural language processing and machine learning techniques field.

Background technology

Hyponymy is a kind of basic semantic relation, is usually used in body, knowledge base, the structure of dictionary and checking.From From the perspective of technology is realized, the acquisition that hyponymy is retrieved as other information provides important support, and it is to body, knowledge Storehouse, dictionary carry out correctness detection, and it expanded and perfect.And noun phrase can be obtained, is particularly not logged in The semantic information of word, semantic relation between more concepts can be obtained by extension.On the whole, it is to know that hyponymy, which obtains, Know obtain in one it is basic and crucial the problem of, yes-no format information is converted to the important step during formatted message Suddenly, it is that further information processing such as the providing the foundation property such as data base querying, data mining, text mining is supported.Simultaneously Hyponymy obtains can also play certain support work to the realization of information retrieval, knowledge question, individual info service etc. With.

The content of the invention

The invention provides a kind of domain entities hyponymy acquisition of bluebeard compound vector sum bootstrapping study and organizer Method, high to language material degree of dependence for solving traditional hyponymy abstracting method, extraction efficiency is than relatively low influence.

The technical scheme is that：A kind of domain entities hyponymy of bluebeard compound vector sum bootstrapping study obtain with Method for organizing, methods described comprise the following steps that：

Step1, the mode learnt first according to bootstrapping, the hyponymy that candidate is obtained from the text of tour field are real Example；

Step1.1, first manual compiling crawlers, tour field text envelope is crawled from tour site and encyclopaedia entry Breath；

The present invention considers that due to different structure of web page the position and label crawled in crawlers is also different, and does not have There is ready-made program, therefore to carry out writing program for crawling different task.Different travel networks are comprehensively chosen as far as possible The language material of page subject matter.Such as Baidupedia entry, tourism info web etc..

Step1.2, the preprocessing process of language material are completed using the kit Ansj that increases income, including participle, part-of-speech tagging, are gone Stop words and name Entity recognition process；

There are some repeated pages, web page tag, idle character etc. in the operative function that the present invention considers to crawl to make an uproar Sound, these noises are invalid.Therefore, to obtain comprising only the high-quality of tour field by filtering, going the operations such as noise to remove The text level language material of amount.

The step Step1.2's concretely comprises the following steps：

Step1.2.1, the web page text information crawled is effectively filtered, remove idle character and webpage；

Step1.2.2, duplicate removal is carried out to obtained effective web, goes junk information pretreatment operation；

Step1.2.3, using Ansj participle instruments operative function segmented, part-of-speech tagging, remove stop words and name The process of Entity recognition.

Step1.3, word can be characterized as to highdensity low-dimensional real number vector due to term vector, can be very good to characterize word The information of morphology, syntax and semantic aspect between language, therefore Google Open-Source Tools bag word2vec is selected, use Skip-gram models carry out term vector model training to pretreated language material；

The training process of term vector model in the present invention, it is premise and basis that hyponymy extracts work, being can not The step lacked, simultaneously because Chinese is mainly made up of character, for English, the semantic relation expression of intercharacter is answered It is miscellaneous, therefore when Chinese text is expressed as term vector, it is necessary to first carry out word segmentation processing.After being segmented using participle instrument, need Manually to proofread.

Step1.4, pretreated document is scanned, filters out the sentence simultaneously containing two or more domain entities Son, selected characteristic context；

The step Step1.4's concretely comprises the following steps：

Step1.4.1, cutting is carried out to text in units of sentence, and carry out artificial entity mark；

Step1.4.2, finally treated document is scanned, filtered out simultaneously containing two or more domain entities Sentence, choose the word BEF before first entity, two inter-entity word BET and second entity after word AFT as in feature Hereafter.

Step1.5, subset acquisition, i.e. each context text are remaining after stop words and adjective is removed Each word is converted into single term vector, then carries out simple combination and obtains characteristic vector, then using three vectors Combination represent any relationship example；

The acquisition of subset in the present invention, while be also premise and basis that hyponymy extracts work, it is indispensable A few step, the acquisition of subset are the keys of bootstrapping study Boost algorithm, only obtain the subset of high quality, could extract Go out the hyponymy decimation pattern of high quality.

Step1.6, hyponymy subset is obtained from Step1.5, generated using the method for Single-pass clusters Hyponymy decimation pattern；The input of algorithm is the list of seed relationship example, and output is relational scheme set.

The step Step1.6's concretely comprises the following steps：

First Step1.6.1, definition example belong to first new empty cluster (pattern)；

Step1.6.2, traversal seed example list, calculate the similarity of any number of sub-instance and each clustering cluster.If Similarity is more than some threshold value, then this kind of sub-instance is added in the clustering cluster (pattern), otherwise creates new clustering cluster (mould Formula).

Step1.6.3, in order to prevent error pattern to be added into set of patterns, herein using marking by the way of enter row mode Screening.

The acquisition of decimation pattern in process of the present invention, it is main to consider that high quality hyponymy decimation pattern is obtained.

Step1.7, after decimation pattern is obtained using Step1.6, waited using the method for new relation case-based system Select the acquisition of relationship example；The input of algorithm is candidate sentence subset and relational scheme set, is exported as candidate relationship example.

The step Step1.7's concretely comprises the following steps：

The document that Step1.7.1, scanning do not mark, obtain all semantic type identicals with relationship example in subset Paragraph.

Step1.7.2, for each paragraph, the generating process such as Step1.6 of relationship example.

If Step1.7.3, relationship example and some pattern similarity are more than or equal to threshold value, then relationship example is just It is considered as a candidate translation example.

Step2, the hyponymy example using candidate, artificial constructed tour field knowledge base, use for reference mapping matrix pair Candidate's hyponymy example carries out hierarchical relationship tissue；

The training data of Step2.1, artificial constructed domain knowledge base as mapping matrix；

The step Step2.1's concretely comprises the following steps：

Step2.1.1, manual compiling crawlers, tour field text envelope is crawled from tour site and encyclopaedia entry Breath；

Step2.1.2, completed using the kit Ansj that increases income, including participle, part-of-speech tagging, count word frequency, obtain and The high vocabulary of seed co-occurrence collects as domain term；

Based on Step2.1.3, the taxonomic hierarchies using interactive encyclopaedia, construct comprising 10000 domain entities Tour field knowledge base.

Step2.2, by the cluster to training data and corresponding mapping training, judge whether two given entities are deposited Hierarchical relationship tissue is carried out in hyponymy.

The step Step2.2's concretely comprises the following steps：

Step2.2.1, K cluster center of mass point is randomly choosed from data set, by hyponymy entity to (x, y) basis Vector offset y-x is clustered using K-means clustering methods；

Step2.2.2, each cluster obtained for Step2.2.1 steps learn a mapping respectivelyMake Φ k^*Value is minimum；

Wherein, Φ_k ^*Mapping matrix is represented, (x, y) represents hyponymy pair, | | Φ_kx-y||²Mean given entity x With its hypernym y, matrix Φ be present_kSo that y=Φ_kX, wherein x represent y hyponym, and y is x hypernym, Φ_kRepresent Transition matrix；N_kIt is cluster gathering C_kThe quantity of entity pair in k-th of clustering cluster；

Step2.2.3, after Step2.2.2 steps obtain every a kind of mapping matrix, judge new word to whether structure Into hyponymy；

Step2.2.4, using collision problem in didactic rule process hierarchical structure, when occurring ring in figure, Remove or overturn most weak side, the most weak side of upset forms an indirect hyponymy, can so ensure final level Change the constraints that structure meets oriented five rings figure.

It is mainly to judge whether two given words have hyponymy, by gathering to training data in the present invention Class, after mapping matrix corresponding to every One class learning, it is possible to judge whether new word forms hyponymy, give two Word x and y, we find the cluster C nearest from their vector offset amount y-x first_k, and obtain corresponding mapping matrix Φ_k, such as Fruit y is an x hypernym, it is necessary to meets two conditions：

Condition one, mapping matrix Φ_kSo that Φ_kX is sufficiently close to y.

Condition two,Meet transitivity.

The beneficial effects of the invention are as follows：

1st, the domain entities hyponymy of bluebeard compound vector sum of the invention bootstrapping study obtains and method for organizing, and existing Some hyponymy abstracting methods compare, and improve the accuracy of hyponymy extraction, and the present invention achieves preferably Effect；

2nd, the domain entities hyponymy of bluebeard compound vector sum of the invention bootstrapping study obtains and method for organizing, and existing Some hyponymy abstracting methods are compared, and vocabulary is shown as to the form of term vector herein, learn Bootload by booting Decimation pattern is obtained, can more preferably extract domain entities hyponymy.

3rd, the domain entities hyponymy acquisition of bluebeard compound vector sum of the invention bootstrapping study and method for organizing, to upper The effective extraction of the next relational implementation, it is information extraction, information retrieval, machine translation and the knowledge mapping structure of follow-up work Strong support is provided Deng work.

Brief description of the drawings

Fig. 1 is total flow chart in the present invention；

Fig. 2 is the certain fields knowledge base Semantic hierarchy figure in the present invention；

Fig. 3 is the semantic hierarchies system construction example of domain entities.

Embodiment

Embodiment 1：As Figure 1-3, a kind of domain entities hyponymy of bluebeard compound vector sum bootstrapping study obtains With method for organizing, methods described comprises the following steps that：

As the further scheme of the present invention, the step Step1.2's concretely comprises the following steps：

As the further scheme of the present invention, the step Step1.4's concretely comprises the following steps：

As the further scheme of the present invention, the step Step1.6's concretely comprises the following steps：

As the further scheme of the present invention, the step Step1.7's concretely comprises the following steps：

As the further scheme of the present invention, the step Step2.1's concretely comprises the following steps：

In order to learn mapping matrix, we are artificial constructed training number of the small-scale domain knowledge base as mapping matrix According to.Herein on the basis of domain attribute and industry attribute is analysed in depth, Manual definition's domain knowledge system, assembling sphere correlation Concept seed set, and the small-scale domain knowledge base of resource construction using network encyclopaedia is aided in, construct comprising 10000 The tour field knowledge base of domain entities.Part of tourism domain knowledge base Semantic hierarchy figure is as shown in Figure 2；

Trained by the cluster to training data and corresponding mapping, it is possible to go to judge whether two given entities are deposited In hyponymy.Fig. 3 is shown using the mapping matrix trained to domain entities hyponymy and progress hierarchical system Structure.

As the further scheme of the present invention, the step Step2.2's concretely comprises the following steps：

Step2.2.2, each cluster obtained for Step2.2.1 steps learn a mapping respectivelyMake Φ_k ^*Value is minimum；

The present embodiment constructs tour field knowledge base of the scale for 10000 domain entities, is this patent mapping matrix Study provide the support of language material；

In order to verify the effect of the name entity of the invention identified, unified evaluation criterion will be used：Accuracy (Precision), recall rate (Recall), F values weigh performance of the invention as the evaluation criterion of the present invention.

The present invention is in order to verify that the validity of the invention, possible designs following groups are verified：

Experiment one：The influence of performance is extracted to verify three kinds of features to domain entities hyponymy, is selected respectively herein Two kinds of different weight parameters are taken, as shown in Conf1 and Conf2：

Conf1:α=0.1, β=0.8, γ=0.1

Conf2:α=0.2, β=0.6, γ=0.2

Wherein, Conf1 only includes the context of two inter-entity words (BET), and Conf2 is included above and below three all category features Literary information.Here accuracy rate, recall rate and F values is the average value under TOP5 patterns.Experimental result is as shown in table 1.

The different characteristic of table 1 extracts performance impact to domain entities hyponymy

Parameter	P (%)	R (%)	F (%)
				Conf1	85.8	70.2	77.2
Conf2	79.4	63.5	70.6

From the experimental data of table 1, for most types of hyponymy pattern, Conf2 parameter settings are taken The recall rate obtained will be less than Conf1 parameters.By being found to analysis of experimental results, chief reason is exactly because BEF and AFT Context data is too sparse, contains and does not much have contributive word to relation between entity pair.Test result indicates that entity pair Between context words have prior effect to the identification of the hyponymy of entity pair.

Experiment two：In order to verify set forth herein method feasibility, contrast examination is carried out on identical experimental data set Test.The TOP5 of selection mode cluster result is tested.Experimental result is as shown in table 2.

The comparison of the different hyponymy abstracting methods of table 2

In table 2 as can be seen that for pattern clustering result TOP5, compared with Snowball algorithms, set forth herein side Method all achieves relatively good F values.And for has something to do pattern, or even achieve be higher by than Snowball 20% F values. Test result indicates that the effect of entity hyponymy with the semantic feature of representation language, can be improved using term vector model.

Experiment three：In order to verify influence of the domain knowledge base to hierarchical relationship system construction, experiment is respectively in the field of addition Knowledge base and carry out in the case of being not added with two kinds of people's domain knowledge base, experimental result is as shown in table 3

Influence of the domain knowledge base of table 3 to domain entities hyponymy tissue

Method	P (%)	R (%)	F (%)
				Term vector	75.3	67.5	69.3
Term vector+knowledge base	78.3	79.8	79.0

As shown in Table 3, in the case where adding domain knowledge base and entering row constraint, the recall rate of method has very big carry It is high, it was demonstrated that hyponymy tissue important role of the domain knowledge base to domain entities

Experiment four in order to verify the feasibility set forth herein method, by set forth herein method and rule-based method, Method based on CRF is compared, and experimental result is as shown in table 4.

The domain entities hyponymy recognition result of table 4

Method	P (%)	R (%)	F (%)
				Rule-based method	84.4	48.9	61.9
Based on CRF methods	75.1	72.4	73.7
				Context of methods	78.2	79.8	79.0

As shown in Table 4, compared with rule-based method, although set forth herein the method accuracy rate based on term vector on It is slightly lower, but in recall rate far beyond rule-based method.And with based on stacking condition random field method ratio, herein The method condition of proposition all increases in accuracy rate and recall rate.Test result indicates that set forth herein method in field The physical hierarchy system feasibility in structure task automatically.

Above in conjunction with accompanying drawing to the present invention embodiment be explained in detail, but the present invention be not limited to it is above-mentioned Embodiment, can also be before present inventive concept not be departed from those of ordinary skill in the art's possessed knowledge Put that various changes can be made.

Claims

1. the domain entities hyponymy of bluebeard compound vector sum bootstrapping study obtains and method for organizing, it is characterised in that：It is described Method comprises the following steps that：

Step1, the mode learnt first according to bootstrapping, the hyponymy example of candidate is obtained from the text of tour field；

Step1.1, first manual compiling crawlers, tour field text message is crawled from tour site and encyclopaedia entry；

Step1.2, the preprocessing process of language material are completed using the kit Ansj that increases income, including participle, part-of-speech tagging, go to disable Word and name Entity recognition process；

Step1.3, the Open-Source Tools bag word2vec for selecting Google, using Skip-gram models to pretreated language material Carry out term vector model training；

Step1.4, pretreated document is scanned, filters out the sentence simultaneously containing two or more domain entities, selected Take feature context；

Step1.5, subset acquisition, i.e. each context text are remaining each after stop words and adjective is removed Word is all converted into single term vector, then carries out simple combination and obtains characteristic vector, then using three vectorial groups Close to represent any relationship example；

Step1.6, hyponymy subset is obtained from Step1.5, generated up and down using the method for Single-pass clusters Position Relation extraction pattern；

Step1.7, after decimation pattern is obtained using Step1.6, candidate pass is carried out using the method for new relation case-based system It is the acquisition of example；

Step2, the hyponymy example using candidate, artificial constructed tour field knowledge base, mapping matrix are used for reference to candidate Hyponymy example carries out hierarchical relationship tissue；

Step2.2, by the cluster to training data and corresponding mapping training, judge two given entities with the presence or absence of upper The next relation carries out hierarchical relationship tissue.

2. the domain entities hyponymy of bluebeard compound vector sum bootstrapping study according to claim 1 obtains and organizer Method, it is characterised in that：The step Step1.2's concretely comprises the following steps：

Step1.2.3, using Ansj participle instrument operative function is segmented, part-of-speech tagging, go stop words and name entity The process of identification.

3. the domain entities hyponymy of bluebeard compound vector sum bootstrapping study according to claim 1 obtains and organizer Method, it is characterised in that：The step Step1.4's concretely comprises the following steps：

Step1.4.2, finally treated document is scanned, filters out the sentence simultaneously containing two or more domain entities Son, choose the word BEF before first entity, two inter-entity word BET and second entity after word AFT as feature above and below Text.

4. the domain entities hyponymy of bluebeard compound vector sum bootstrapping study according to claim 1 obtains and organizer Method, it is characterised in that：The step Step1.6's concretely comprises the following steps：

First Step1.6.1, setting example belong to first new empty cluster pattern；

Step1.6.2, traversal seed example list, calculate the similarity of any number of sub-instance and each clustering cluster；It is if similar Degree is more than some threshold value, then this kind of sub-instance is added in the clustering cluster pattern, otherwise creates new clustering cluster pattern；

Step1.6.3, in order to prevent error pattern to be added into set of patterns, the screening of row mode is entered by the way of marking.

5. the domain entities hyponymy of bluebeard compound vector sum bootstrapping study according to claim 1 obtains and organizer Method, it is characterised in that：The step Step1.7's concretely comprises the following steps：

The document that Step1.7.1, scanning do not mark, obtain all semantic type identical sections with relationship example in subset Fall；

Step1.7.2, for each paragraph, the generating process such as step Step1.6 of relationship example；

If Step1.7.3, relationship example and some pattern similarity are more than or equal to threshold value, then relationship example is just recognized To be a candidate translation example.

6. the domain entities hyponymy of bluebeard compound vector sum bootstrapping study according to claim 1 obtains and organizer Method, it is characterised in that：The step Step2.1's concretely comprises the following steps：

Step2.1.1, manual compiling crawlers, tour field text message is crawled from tour site and encyclopaedia entry；

Step2.1.2, completed using the kit Ansj that increases income, including participle, part-of-speech tagging, count word frequency, obtained and seed The high vocabulary of co-occurrence collects as domain term；

Based on Step2.1.3, the taxonomic hierarchies using interactive encyclopaedia, the tourism for including 10000 domain entities is constructed Domain knowledge base.

7. the domain entities hyponymy of bluebeard compound vector sum bootstrapping study according to claim 1 obtains and organizer Method, it is characterised in that：The step Step2.2's concretely comprises the following steps：

Step2.2.1, K cluster center of mass point is randomly choosed from data set, by hyponymy entity to (x, y) according to vector Skew y-x is clustered using K-means clustering methods；

Wherein, Φ_k ^*Mapping matrix is represented, (x, y) represents hyponymy pair, | | Φ_kx-y||²Mean given entity x and it Hypernym y, matrix Φ be present_kSo that y=Φ_kX, wherein x represent y hyponym, and y is x hypernym, Φ_kRepresent transition Matrix；N_kIt is cluster gathering C_kThe quantity of entity pair in k-th of clustering cluster；

Step2.2.3, after Step2.2.2 steps obtain every a kind of mapping matrix, judge new word to whether forming The next relation；

Step2.2.4, using collision problem in didactic rule process hierarchical structure, when occurring ring in figure, remove Or the side that upset is most weak, the most weak side of upset form an indirect hyponymy.