CN107463607A - The domain entities hyponymy of bluebeard compound vector sum bootstrapping study obtains and method for organizing - Google Patents

The domain entities hyponymy of bluebeard compound vector sum bootstrapping study obtains and method for organizing Download PDF

Info

Publication number
CN107463607A
CN107463607A CN201710484051.XA CN201710484051A CN107463607A CN 107463607 A CN107463607 A CN 107463607A CN 201710484051 A CN201710484051 A CN 201710484051A CN 107463607 A CN107463607 A CN 107463607A
Authority
CN
China
Prior art keywords
hyponymy
bootstrapping
entity
domain
domain entities
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710484051.XA
Other languages
Chinese (zh)
Other versions
CN107463607B (en
Inventor
郭剑毅
马晓军
余正涛
陈玮
张志坤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN201710484051.XA priority Critical patent/CN107463607B/en
Publication of CN107463607A publication Critical patent/CN107463607A/en
Application granted granted Critical
Publication of CN107463607B publication Critical patent/CN107463607B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The present invention relates to a kind of domain entities hyponymy acquisition of bluebeard compound vector sum bootstrapping study and method for organizing, belong to natural language processing and machine learning techniques field.The present invention is first according to the mode of bootstrapping study, the hyponymy example of candidate is obtained from the text of tour field, using the hyponymy example of candidate, artificial constructed tour field knowledge base, use for reference mapping matrix and hierarchical relationship tissue is carried out to candidate's hyponymy example.The present invention realizes effective extraction to hyponymy, strong support is provided for work such as information extraction, information retrieval and machine translation, compared with current recognition methods, accuracy of the invention, recall rate, F values are improved, therefore the present invention has certain Research Significance.

Description

The domain entities hyponymy of bluebeard compound vector sum bootstrapping study obtains and tissue Method
Technical field
The present invention relates to the domain entities hyponymy acquisition of bluebeard compound vector sum bootstrapping study and method for organizing, belong to Natural language processing and machine learning techniques field.
Background technology
Hyponymy is a kind of basic semantic relation, is usually used in body, knowledge base, the structure of dictionary and checking.From From the perspective of technology is realized, the acquisition that hyponymy is retrieved as other information provides important support, and it is to body, knowledge Storehouse, dictionary carry out correctness detection, and it expanded and perfect.And noun phrase can be obtained, is particularly not logged in The semantic information of word, semantic relation between more concepts can be obtained by extension.On the whole, it is to know that hyponymy, which obtains, Know obtain in one it is basic and crucial the problem of, yes-no format information is converted to the important step during formatted message Suddenly, it is that further information processing such as the providing the foundation property such as data base querying, data mining, text mining is supported.Simultaneously Hyponymy obtains can also play certain support work to the realization of information retrieval, knowledge question, individual info service etc. With.
The content of the invention
The invention provides a kind of domain entities hyponymy acquisition of bluebeard compound vector sum bootstrapping study and organizer Method, high to language material degree of dependence for solving traditional hyponymy abstracting method, extraction efficiency is than relatively low influence.
The technical scheme is that:A kind of domain entities hyponymy of bluebeard compound vector sum bootstrapping study obtain with Method for organizing, methods described comprise the following steps that:
Step1, the mode learnt first according to bootstrapping, the hyponymy that candidate is obtained from the text of tour field are real Example;
Step1.1, first manual compiling crawlers, tour field text envelope is crawled from tour site and encyclopaedia entry Breath;
The present invention considers that due to different structure of web page the position and label crawled in crawlers is also different, and does not have There is ready-made program, therefore to carry out writing program for crawling different task.Different travel networks are comprehensively chosen as far as possible The language material of page subject matter.Such as Baidupedia entry, tourism info web etc..
Step1.2, the preprocessing process of language material are completed using the kit Ansj that increases income, including participle, part-of-speech tagging, are gone Stop words and name Entity recognition process;
There are some repeated pages, web page tag, idle character etc. in the operative function that the present invention considers to crawl to make an uproar Sound, these noises are invalid.Therefore, to obtain comprising only the high-quality of tour field by filtering, going the operations such as noise to remove The text level language material of amount.
The step Step1.2's concretely comprises the following steps:
Step1.2.1, the web page text information crawled is effectively filtered, remove idle character and webpage;
Step1.2.2, duplicate removal is carried out to obtained effective web, goes junk information pretreatment operation;
Step1.2.3, using Ansj participle instruments operative function segmented, part-of-speech tagging, remove stop words and name The process of Entity recognition.
Step1.3, word can be characterized as to highdensity low-dimensional real number vector due to term vector, can be very good to characterize word The information of morphology, syntax and semantic aspect between language, therefore Google Open-Source Tools bag word2vec is selected, use Skip-gram models carry out term vector model training to pretreated language material;
The training process of term vector model in the present invention, it is premise and basis that hyponymy extracts work, being can not The step lacked, simultaneously because Chinese is mainly made up of character, for English, the semantic relation expression of intercharacter is answered It is miscellaneous, therefore when Chinese text is expressed as term vector, it is necessary to first carry out word segmentation processing.After being segmented using participle instrument, need Manually to proofread.
Step1.4, pretreated document is scanned, filters out the sentence simultaneously containing two or more domain entities Son, selected characteristic context;
The step Step1.4's concretely comprises the following steps:
Step1.4.1, cutting is carried out to text in units of sentence, and carry out artificial entity mark;
Step1.4.2, finally treated document is scanned, filtered out simultaneously containing two or more domain entities Sentence, choose the word BEF before first entity, two inter-entity word BET and second entity after word AFT as in feature Hereafter.
Step1.5, subset acquisition, i.e. each context text are remaining after stop words and adjective is removed Each word is converted into single term vector, then carries out simple combination and obtains characteristic vector, then using three vectors Combination represent any relationship example;
The acquisition of subset in the present invention, while be also premise and basis that hyponymy extracts work, it is indispensable A few step, the acquisition of subset are the keys of bootstrapping study Boost algorithm, only obtain the subset of high quality, could extract Go out the hyponymy decimation pattern of high quality.
Step1.6, hyponymy subset is obtained from Step1.5, generated using the method for Single-pass clusters Hyponymy decimation pattern;The input of algorithm is the list of seed relationship example, and output is relational scheme set.
The step Step1.6's concretely comprises the following steps:
First Step1.6.1, definition example belong to first new empty cluster (pattern);
Step1.6.2, traversal seed example list, calculate the similarity of any number of sub-instance and each clustering cluster.If Similarity is more than some threshold value, then this kind of sub-instance is added in the clustering cluster (pattern), otherwise creates new clustering cluster (mould Formula).
Step1.6.3, in order to prevent error pattern to be added into set of patterns, herein using marking by the way of enter row mode Screening.
The acquisition of decimation pattern in process of the present invention, it is main to consider that high quality hyponymy decimation pattern is obtained.
Step1.7, after decimation pattern is obtained using Step1.6, waited using the method for new relation case-based system Select the acquisition of relationship example;The input of algorithm is candidate sentence subset and relational scheme set, is exported as candidate relationship example.
The step Step1.7's concretely comprises the following steps:
The document that Step1.7.1, scanning do not mark, obtain all semantic type identicals with relationship example in subset Paragraph.
Step1.7.2, for each paragraph, the generating process such as Step1.6 of relationship example.
If Step1.7.3, relationship example and some pattern similarity are more than or equal to threshold value, then relationship example is just It is considered as a candidate translation example.
Step2, the hyponymy example using candidate, artificial constructed tour field knowledge base, use for reference mapping matrix pair Candidate's hyponymy example carries out hierarchical relationship tissue;
The training data of Step2.1, artificial constructed domain knowledge base as mapping matrix;
The step Step2.1's concretely comprises the following steps:
Step2.1.1, manual compiling crawlers, tour field text envelope is crawled from tour site and encyclopaedia entry Breath;
Step2.1.2, completed using the kit Ansj that increases income, including participle, part-of-speech tagging, count word frequency, obtain and The high vocabulary of seed co-occurrence collects as domain term;
Based on Step2.1.3, the taxonomic hierarchies using interactive encyclopaedia, construct comprising 10000 domain entities Tour field knowledge base.
Step2.2, by the cluster to training data and corresponding mapping training, judge whether two given entities are deposited Hierarchical relationship tissue is carried out in hyponymy.
The step Step2.2's concretely comprises the following steps:
Step2.2.1, K cluster center of mass point is randomly choosed from data set, by hyponymy entity to (x, y) basis Vector offset y-x is clustered using K-means clustering methods;
Step2.2.2, each cluster obtained for Step2.2.1 steps learn a mapping respectivelyMake Φ k*Value is minimum;
Wherein, Φk *Mapping matrix is represented, (x, y) represents hyponymy pair, | | Φkx-y||2Mean given entity x With its hypernym y, matrix Φ be presentkSo that y=ΦkX, wherein x represent y hyponym, and y is x hypernym, ΦkRepresent Transition matrix;NkIt is cluster gathering CkThe quantity of entity pair in k-th of clustering cluster;
Step2.2.3, after Step2.2.2 steps obtain every a kind of mapping matrix, judge new word to whether structure Into hyponymy;
Step2.2.4, using collision problem in didactic rule process hierarchical structure, when occurring ring in figure, Remove or overturn most weak side, the most weak side of upset forms an indirect hyponymy, can so ensure final level Change the constraints that structure meets oriented five rings figure.
It is mainly to judge whether two given words have hyponymy, by gathering to training data in the present invention Class, after mapping matrix corresponding to every One class learning, it is possible to judge whether new word forms hyponymy, give two Word x and y, we find the cluster C nearest from their vector offset amount y-x firstk, and obtain corresponding mapping matrix Φk, such as Fruit y is an x hypernym, it is necessary to meets two conditions:
Condition one, mapping matrix ΦkSo that ΦkX is sufficiently close to y.
Condition two,Meet transitivity.
The beneficial effects of the invention are as follows:
1st, the domain entities hyponymy of bluebeard compound vector sum of the invention bootstrapping study obtains and method for organizing, and existing Some hyponymy abstracting methods compare, and improve the accuracy of hyponymy extraction, and the present invention achieves preferably Effect;
2nd, the domain entities hyponymy of bluebeard compound vector sum of the invention bootstrapping study obtains and method for organizing, and existing Some hyponymy abstracting methods are compared, and vocabulary is shown as to the form of term vector herein, learn Bootload by booting Decimation pattern is obtained, can more preferably extract domain entities hyponymy.
3rd, the domain entities hyponymy acquisition of bluebeard compound vector sum of the invention bootstrapping study and method for organizing, to upper The effective extraction of the next relational implementation, it is information extraction, information retrieval, machine translation and the knowledge mapping structure of follow-up work Strong support is provided Deng work.
Brief description of the drawings
Fig. 1 is total flow chart in the present invention;
Fig. 2 is the certain fields knowledge base Semantic hierarchy figure in the present invention;
Fig. 3 is the semantic hierarchies system construction example of domain entities.
Embodiment
Embodiment 1:As Figure 1-3, a kind of domain entities hyponymy of bluebeard compound vector sum bootstrapping study obtains With method for organizing, methods described comprises the following steps that:
Step1, the mode learnt first according to bootstrapping, the hyponymy that candidate is obtained from the text of tour field are real Example;
Step1.1, first manual compiling crawlers, tour field text envelope is crawled from tour site and encyclopaedia entry Breath;
The present invention considers that due to different structure of web page the position and label crawled in crawlers is also different, and does not have There is ready-made program, therefore to carry out writing program for crawling different task.Different travel networks are comprehensively chosen as far as possible The language material of page subject matter.Such as Baidupedia entry, tourism info web etc..
Step1.2, the preprocessing process of language material are completed using the kit Ansj that increases income, including participle, part-of-speech tagging, are gone Stop words and name Entity recognition process;
There are some repeated pages, web page tag, idle character etc. in the operative function that the present invention considers to crawl to make an uproar Sound, these noises are invalid.Therefore, to obtain comprising only the high-quality of tour field by filtering, going the operations such as noise to remove The text level language material of amount.
As the further scheme of the present invention, the step Step1.2's concretely comprises the following steps:
Step1.2.1, the web page text information crawled is effectively filtered, remove idle character and webpage;
Step1.2.2, duplicate removal is carried out to obtained effective web, goes junk information pretreatment operation;
Step1.2.3, using Ansj participle instruments operative function segmented, part-of-speech tagging, remove stop words and name The process of Entity recognition.
Step1.3, word can be characterized as to highdensity low-dimensional real number vector due to term vector, can be very good to characterize word The information of morphology, syntax and semantic aspect between language, therefore Google Open-Source Tools bag word2vec is selected, use Skip-gram models carry out term vector model training to pretreated language material;
The training process of term vector model in the present invention, it is premise and basis that hyponymy extracts work, being can not The step lacked, simultaneously because Chinese is mainly made up of character, for English, the semantic relation expression of intercharacter is answered It is miscellaneous, therefore when Chinese text is expressed as term vector, it is necessary to first carry out word segmentation processing.After being segmented using participle instrument, need Manually to proofread.
Step1.4, pretreated document is scanned, filters out the sentence simultaneously containing two or more domain entities Son, selected characteristic context;
As the further scheme of the present invention, the step Step1.4's concretely comprises the following steps:
Step1.4.1, cutting is carried out to text in units of sentence, and carry out artificial entity mark;
Step1.4.2, finally treated document is scanned, filtered out simultaneously containing two or more domain entities Sentence, choose the word BEF before first entity, two inter-entity word BET and second entity after word AFT as in feature Hereafter.
Step1.5, subset acquisition, i.e. each context text are remaining after stop words and adjective is removed Each word is converted into single term vector, then carries out simple combination and obtains characteristic vector, then using three vectors Combination represent any relationship example;
The acquisition of subset in the present invention, while be also premise and basis that hyponymy extracts work, it is indispensable A few step, the acquisition of subset are the keys of bootstrapping study Boost algorithm, only obtain the subset of high quality, could extract Go out the hyponymy decimation pattern of high quality.
Step1.6, hyponymy subset is obtained from Step1.5, generated using the method for Single-pass clusters Hyponymy decimation pattern;The input of algorithm is the list of seed relationship example, and output is relational scheme set.
As the further scheme of the present invention, the step Step1.6's concretely comprises the following steps:
First Step1.6.1, definition example belong to first new empty cluster (pattern);
Step1.6.2, traversal seed example list, calculate the similarity of any number of sub-instance and each clustering cluster.If Similarity is more than some threshold value, then this kind of sub-instance is added in the clustering cluster (pattern), otherwise creates new clustering cluster (mould Formula).
Step1.6.3, in order to prevent error pattern to be added into set of patterns, herein using marking by the way of enter row mode Screening.
The acquisition of decimation pattern in process of the present invention, it is main to consider that high quality hyponymy decimation pattern is obtained.
Step1.7, after decimation pattern is obtained using Step1.6, waited using the method for new relation case-based system Select the acquisition of relationship example;The input of algorithm is candidate sentence subset and relational scheme set, is exported as candidate relationship example.
As the further scheme of the present invention, the step Step1.7's concretely comprises the following steps:
The document that Step1.7.1, scanning do not mark, obtain all semantic type identicals with relationship example in subset Paragraph.
Step1.7.2, for each paragraph, the generating process such as Step1.6 of relationship example.
If Step1.7.3, relationship example and some pattern similarity are more than or equal to threshold value, then relationship example is just It is considered as a candidate translation example.
Step2, the hyponymy example using candidate, artificial constructed tour field knowledge base, use for reference mapping matrix pair Candidate's hyponymy example carries out hierarchical relationship tissue;
The training data of Step2.1, artificial constructed domain knowledge base as mapping matrix;
As the further scheme of the present invention, the step Step2.1's concretely comprises the following steps:
Step2.1.1, manual compiling crawlers, tour field text envelope is crawled from tour site and encyclopaedia entry Breath;
Step2.1.2, completed using the kit Ansj that increases income, including participle, part-of-speech tagging, count word frequency, obtain and The high vocabulary of seed co-occurrence collects as domain term;
Based on Step2.1.3, the taxonomic hierarchies using interactive encyclopaedia, construct comprising 10000 domain entities Tour field knowledge base.
In order to learn mapping matrix, we are artificial constructed training number of the small-scale domain knowledge base as mapping matrix According to.Herein on the basis of domain attribute and industry attribute is analysed in depth, Manual definition's domain knowledge system, assembling sphere correlation Concept seed set, and the small-scale domain knowledge base of resource construction using network encyclopaedia is aided in, construct comprising 10000 The tour field knowledge base of domain entities.Part of tourism domain knowledge base Semantic hierarchy figure is as shown in Figure 2;
Step2.2, by the cluster to training data and corresponding mapping training, judge whether two given entities are deposited Hierarchical relationship tissue is carried out in hyponymy.
Trained by the cluster to training data and corresponding mapping, it is possible to go to judge whether two given entities are deposited In hyponymy.Fig. 3 is shown using the mapping matrix trained to domain entities hyponymy and progress hierarchical system Structure.
As the further scheme of the present invention, the step Step2.2's concretely comprises the following steps:
Step2.2.1, K cluster center of mass point is randomly choosed from data set, by hyponymy entity to (x, y) basis Vector offset y-x is clustered using K-means clustering methods;
Step2.2.2, each cluster obtained for Step2.2.1 steps learn a mapping respectivelyMake Φk *Value is minimum;
Wherein, Φk *Mapping matrix is represented, (x, y) represents hyponymy pair, | | Φkx-y||2Mean given entity x With its hypernym y, matrix Φ be presentkSo that y=ΦkX, wherein x represent y hyponym, and y is x hypernym, ΦkRepresent Transition matrix;NkIt is cluster gathering CkThe quantity of entity pair in k-th of clustering cluster;
Step2.2.3, after Step2.2.2 steps obtain every a kind of mapping matrix, judge new word to whether structure Into hyponymy;
Step2.2.4, using collision problem in didactic rule process hierarchical structure, when occurring ring in figure, Remove or overturn most weak side, the most weak side of upset forms an indirect hyponymy, can so ensure final level Change the constraints that structure meets oriented five rings figure.
The present embodiment constructs tour field knowledge base of the scale for 10000 domain entities, is this patent mapping matrix Study provide the support of language material;
In order to verify the effect of the name entity of the invention identified, unified evaluation criterion will be used:Accuracy (Precision), recall rate (Recall), F values weigh performance of the invention as the evaluation criterion of the present invention.
The present invention is in order to verify that the validity of the invention, possible designs following groups are verified:
Experiment one:The influence of performance is extracted to verify three kinds of features to domain entities hyponymy, is selected respectively herein Two kinds of different weight parameters are taken, as shown in Conf1 and Conf2:
Conf1:α=0.1, β=0.8, γ=0.1
Conf2:α=0.2, β=0.6, γ=0.2
Wherein, Conf1 only includes the context of two inter-entity words (BET), and Conf2 is included above and below three all category features Literary information.Here accuracy rate, recall rate and F values is the average value under TOP5 patterns.Experimental result is as shown in table 1.
The different characteristic of table 1 extracts performance impact to domain entities hyponymy
Parameter P (%) R (%) F (%)
Conf1 85.8 70.2 77.2
Conf2 79.4 63.5 70.6
From the experimental data of table 1, for most types of hyponymy pattern, Conf2 parameter settings are taken The recall rate obtained will be less than Conf1 parameters.By being found to analysis of experimental results, chief reason is exactly because BEF and AFT Context data is too sparse, contains and does not much have contributive word to relation between entity pair.Test result indicates that entity pair Between context words have prior effect to the identification of the hyponymy of entity pair.
Experiment two:In order to verify set forth herein method feasibility, contrast examination is carried out on identical experimental data set Test.The TOP5 of selection mode cluster result is tested.Experimental result is as shown in table 2.
The comparison of the different hyponymy abstracting methods of table 2
In table 2 as can be seen that for pattern clustering result TOP5, compared with Snowball algorithms, set forth herein side Method all achieves relatively good F values.And for has something to do pattern, or even achieve be higher by than Snowball 20% F values. Test result indicates that the effect of entity hyponymy with the semantic feature of representation language, can be improved using term vector model.
Experiment three:In order to verify influence of the domain knowledge base to hierarchical relationship system construction, experiment is respectively in the field of addition Knowledge base and carry out in the case of being not added with two kinds of people's domain knowledge base, experimental result is as shown in table 3
Influence of the domain knowledge base of table 3 to domain entities hyponymy tissue
Method P (%) R (%) F (%)
Term vector 75.3 67.5 69.3
Term vector+knowledge base 78.3 79.8 79.0
As shown in Table 3, in the case where adding domain knowledge base and entering row constraint, the recall rate of method has very big carry It is high, it was demonstrated that hyponymy tissue important role of the domain knowledge base to domain entities
Experiment four in order to verify the feasibility set forth herein method, by set forth herein method and rule-based method, Method based on CRF is compared, and experimental result is as shown in table 4.
The domain entities hyponymy recognition result of table 4
Method P (%) R (%) F (%)
Rule-based method 84.4 48.9 61.9
Based on CRF methods 75.1 72.4 73.7
Context of methods 78.2 79.8 79.0
As shown in Table 4, compared with rule-based method, although set forth herein the method accuracy rate based on term vector on It is slightly lower, but in recall rate far beyond rule-based method.And with based on stacking condition random field method ratio, herein The method condition of proposition all increases in accuracy rate and recall rate.Test result indicates that set forth herein method in field The physical hierarchy system feasibility in structure task automatically.
Above in conjunction with accompanying drawing to the present invention embodiment be explained in detail, but the present invention be not limited to it is above-mentioned Embodiment, can also be before present inventive concept not be departed from those of ordinary skill in the art's possessed knowledge Put that various changes can be made.

Claims (7)

1. the domain entities hyponymy of bluebeard compound vector sum bootstrapping study obtains and method for organizing, it is characterised in that:It is described Method comprises the following steps that:
Step1, the mode learnt first according to bootstrapping, the hyponymy example of candidate is obtained from the text of tour field;
Step1.1, first manual compiling crawlers, tour field text message is crawled from tour site and encyclopaedia entry;
Step1.2, the preprocessing process of language material are completed using the kit Ansj that increases income, including participle, part-of-speech tagging, go to disable Word and name Entity recognition process;
Step1.3, the Open-Source Tools bag word2vec for selecting Google, using Skip-gram models to pretreated language material Carry out term vector model training;
Step1.4, pretreated document is scanned, filters out the sentence simultaneously containing two or more domain entities, selected Take feature context;
Step1.5, subset acquisition, i.e. each context text are remaining each after stop words and adjective is removed Word is all converted into single term vector, then carries out simple combination and obtains characteristic vector, then using three vectorial groups Close to represent any relationship example;
Step1.6, hyponymy subset is obtained from Step1.5, generated up and down using the method for Single-pass clusters Position Relation extraction pattern;
Step1.7, after decimation pattern is obtained using Step1.6, candidate pass is carried out using the method for new relation case-based system It is the acquisition of example;
Step2, the hyponymy example using candidate, artificial constructed tour field knowledge base, mapping matrix are used for reference to candidate Hyponymy example carries out hierarchical relationship tissue;
The training data of Step2.1, artificial constructed domain knowledge base as mapping matrix;
Step2.2, by the cluster to training data and corresponding mapping training, judge two given entities with the presence or absence of upper The next relation carries out hierarchical relationship tissue.
2. the domain entities hyponymy of bluebeard compound vector sum bootstrapping study according to claim 1 obtains and organizer Method, it is characterised in that:The step Step1.2's concretely comprises the following steps:
Step1.2.1, the web page text information crawled is effectively filtered, remove idle character and webpage;
Step1.2.2, duplicate removal is carried out to obtained effective web, goes junk information pretreatment operation;
Step1.2.3, using Ansj participle instrument operative function is segmented, part-of-speech tagging, go stop words and name entity The process of identification.
3. the domain entities hyponymy of bluebeard compound vector sum bootstrapping study according to claim 1 obtains and organizer Method, it is characterised in that:The step Step1.4's concretely comprises the following steps:
Step1.4.1, cutting is carried out to text in units of sentence, and carry out artificial entity mark;
Step1.4.2, finally treated document is scanned, filters out the sentence simultaneously containing two or more domain entities Son, choose the word BEF before first entity, two inter-entity word BET and second entity after word AFT as feature above and below Text.
4. the domain entities hyponymy of bluebeard compound vector sum bootstrapping study according to claim 1 obtains and organizer Method, it is characterised in that:The step Step1.6's concretely comprises the following steps:
First Step1.6.1, setting example belong to first new empty cluster pattern;
Step1.6.2, traversal seed example list, calculate the similarity of any number of sub-instance and each clustering cluster;It is if similar Degree is more than some threshold value, then this kind of sub-instance is added in the clustering cluster pattern, otherwise creates new clustering cluster pattern;
Step1.6.3, in order to prevent error pattern to be added into set of patterns, the screening of row mode is entered by the way of marking.
5. the domain entities hyponymy of bluebeard compound vector sum bootstrapping study according to claim 1 obtains and organizer Method, it is characterised in that:The step Step1.7's concretely comprises the following steps:
The document that Step1.7.1, scanning do not mark, obtain all semantic type identical sections with relationship example in subset Fall;
Step1.7.2, for each paragraph, the generating process such as step Step1.6 of relationship example;
If Step1.7.3, relationship example and some pattern similarity are more than or equal to threshold value, then relationship example is just recognized To be a candidate translation example.
6. the domain entities hyponymy of bluebeard compound vector sum bootstrapping study according to claim 1 obtains and organizer Method, it is characterised in that:The step Step2.1's concretely comprises the following steps:
Step2.1.1, manual compiling crawlers, tour field text message is crawled from tour site and encyclopaedia entry;
Step2.1.2, completed using the kit Ansj that increases income, including participle, part-of-speech tagging, count word frequency, obtained and seed The high vocabulary of co-occurrence collects as domain term;
Based on Step2.1.3, the taxonomic hierarchies using interactive encyclopaedia, the tourism for including 10000 domain entities is constructed Domain knowledge base.
7. the domain entities hyponymy of bluebeard compound vector sum bootstrapping study according to claim 1 obtains and organizer Method, it is characterised in that:The step Step2.2's concretely comprises the following steps:
Step2.2.1, K cluster center of mass point is randomly choosed from data set, by hyponymy entity to (x, y) according to vector Skew y-x is clustered using K-means clustering methods;
Step2.2.2, each cluster obtained for Step2.2.1 steps learn a mapping respectivelyMake Φk *Value is minimum;
Wherein, Φk *Mapping matrix is represented, (x, y) represents hyponymy pair, | | Φkx-y||2Mean given entity x and it Hypernym y, matrix Φ be presentkSo that y=ΦkX, wherein x represent y hyponym, and y is x hypernym, ΦkRepresent transition Matrix;NkIt is cluster gathering CkThe quantity of entity pair in k-th of clustering cluster;
Step2.2.3, after Step2.2.2 steps obtain every a kind of mapping matrix, judge new word to whether forming The next relation;
Step2.2.4, using collision problem in didactic rule process hierarchical structure, when occurring ring in figure, remove Or the side that upset is most weak, the most weak side of upset form an indirect hyponymy.
CN201710484051.XA 2017-06-23 2017-06-23 Method for acquiring and organizing upper and lower relations of domain entities by combining word vectors and bootstrap learning Active CN107463607B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710484051.XA CN107463607B (en) 2017-06-23 2017-06-23 Method for acquiring and organizing upper and lower relations of domain entities by combining word vectors and bootstrap learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710484051.XA CN107463607B (en) 2017-06-23 2017-06-23 Method for acquiring and organizing upper and lower relations of domain entities by combining word vectors and bootstrap learning

Publications (2)

Publication Number Publication Date
CN107463607A true CN107463607A (en) 2017-12-12
CN107463607B CN107463607B (en) 2020-07-31

Family

ID=60546337

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710484051.XA Active CN107463607B (en) 2017-06-23 2017-06-23 Method for acquiring and organizing upper and lower relations of domain entities by combining word vectors and bootstrap learning

Country Status (1)

Country Link
CN (1) CN107463607B (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108280221A (en) * 2018-02-08 2018-07-13 北京百度网讯科技有限公司 Stratification construction method, device and the computer equipment of focus
CN108763192A (en) * 2018-04-18 2018-11-06 达而观信息科技(上海)有限公司 Entity relation extraction method and device for text-processing
CN108874878A (en) * 2018-05-03 2018-11-23 众安信息技术服务有限公司 A kind of building system and method for knowledge mapping
CN108897857A (en) * 2018-06-28 2018-11-27 东华大学 The Chinese Text Topic sentence generating method of domain-oriented
CN108959258A (en) * 2018-07-02 2018-12-07 昆明理工大学 It is a kind of that entity link method is integrated based on the specific area for indicating to learn
CN109086328A (en) * 2018-06-29 2018-12-25 北京百度网讯科技有限公司 A kind of determination method, apparatus, server and the storage medium of hyponymy
CN109408642A (en) * 2018-08-30 2019-03-01 昆明理工大学 A kind of domain entities relation on attributes abstracting method based on distance supervision
CN109446530A (en) * 2018-11-03 2019-03-08 上海犀语科技有限公司 It is a kind of based on LSTM model by the method and device of Extracting Information in text
CN109492098A (en) * 2018-10-24 2019-03-19 北京工业大学 Target corpus base construction method based on Active Learning and semantic density
CN109522547A (en) * 2018-10-23 2019-03-26 浙江大学 Chinese synonym iteration abstracting method based on pattern learning
CN109522418A (en) * 2018-11-08 2019-03-26 杭州费尔斯通科技有限公司 A kind of automanual knowledge mapping construction method
CN109740149A (en) * 2018-12-11 2019-05-10 英大传媒投资集团有限公司 A kind of synonym extracting method based on remote supervisory
CN110059310A (en) * 2018-01-19 2019-07-26 腾讯科技(深圳)有限公司 Extending method and device, electronic equipment, the storage medium of hypernym network
CN110209832A (en) * 2018-08-08 2019-09-06 腾讯科技(北京)有限公司 Method of discrimination, system and the computer equipment of hyponymy
CN112528045A (en) * 2020-12-23 2021-03-19 中译语通科技股份有限公司 Method and system for judging domain map relation based on open encyclopedia map
CN114003734A (en) * 2021-11-22 2022-02-01 四川大学华西医院 Breast cancer risk factor knowledge system model, knowledge map system and construction method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015072899A1 (en) * 2013-11-15 2015-05-21 Telefonaktiebolaget L M Ericsson (Publ) Methods and devices for bootstrapping of resource constrained devices
CN106095736A (en) * 2016-06-07 2016-11-09 华东师范大学 A kind of method of field neologisms extraction
CN106844413A (en) * 2016-11-11 2017-06-13 南京缘长信息科技有限公司 The method and device of entity relation extraction

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015072899A1 (en) * 2013-11-15 2015-05-21 Telefonaktiebolaget L M Ericsson (Publ) Methods and devices for bootstrapping of resource constrained devices
CN106095736A (en) * 2016-06-07 2016-11-09 华东师范大学 A kind of method of field neologisms extraction
CN106844413A (en) * 2016-11-11 2017-06-13 南京缘长信息科技有限公司 The method and device of entity relation extraction

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
汪沛 等: "一种结合词向量和图模型的特定领域实体消歧方法", 《智能系统学报》 *

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110059310A (en) * 2018-01-19 2019-07-26 腾讯科技(深圳)有限公司 Extending method and device, electronic equipment, the storage medium of hypernym network
CN110059310B (en) * 2018-01-19 2022-10-28 腾讯科技(深圳)有限公司 Hypernym network expansion method and device, electronic equipment and storage medium
CN108280221B (en) * 2018-02-08 2022-04-15 北京百度网讯科技有限公司 Method and device for hierarchically constructing focus points and computer equipment
CN108280221A (en) * 2018-02-08 2018-07-13 北京百度网讯科技有限公司 Stratification construction method, device and the computer equipment of focus
CN108763192A (en) * 2018-04-18 2018-11-06 达而观信息科技(上海)有限公司 Entity relation extraction method and device for text-processing
CN108763192B (en) * 2018-04-18 2022-04-19 达而观信息科技(上海)有限公司 Entity relation extraction method and device for text processing
CN108874878A (en) * 2018-05-03 2018-11-23 众安信息技术服务有限公司 A kind of building system and method for knowledge mapping
CN108897857B (en) * 2018-06-28 2021-08-27 东华大学 Chinese text subject sentence generating method facing field
CN108897857A (en) * 2018-06-28 2018-11-27 东华大学 The Chinese Text Topic sentence generating method of domain-oriented
CN109086328A (en) * 2018-06-29 2018-12-25 北京百度网讯科技有限公司 A kind of determination method, apparatus, server and the storage medium of hyponymy
CN108959258B (en) * 2018-07-02 2021-06-18 昆明理工大学 Specific field integrated entity linking method based on representation learning
CN108959258A (en) * 2018-07-02 2018-12-07 昆明理工大学 It is a kind of that entity link method is integrated based on the specific area for indicating to learn
CN110209832A (en) * 2018-08-08 2019-09-06 腾讯科技(北京)有限公司 Method of discrimination, system and the computer equipment of hyponymy
CN109408642B (en) * 2018-08-30 2021-07-16 昆明理工大学 Domain entity attribute relation extraction method based on distance supervision
CN109408642A (en) * 2018-08-30 2019-03-01 昆明理工大学 A kind of domain entities relation on attributes abstracting method based on distance supervision
CN109522547A (en) * 2018-10-23 2019-03-26 浙江大学 Chinese synonym iteration abstracting method based on pattern learning
CN109492098B (en) * 2018-10-24 2022-05-06 北京工业大学 Target language material library construction method based on active learning and semantic density
CN109492098A (en) * 2018-10-24 2019-03-19 北京工业大学 Target corpus base construction method based on Active Learning and semantic density
CN109446530A (en) * 2018-11-03 2019-03-08 上海犀语科技有限公司 It is a kind of based on LSTM model by the method and device of Extracting Information in text
CN109522418A (en) * 2018-11-08 2019-03-26 杭州费尔斯通科技有限公司 A kind of automanual knowledge mapping construction method
CN109740149A (en) * 2018-12-11 2019-05-10 英大传媒投资集团有限公司 A kind of synonym extracting method based on remote supervisory
CN109740149B (en) * 2018-12-11 2019-12-13 英大传媒投资集团有限公司 remote supervision-based synonym extraction method
CN112528045A (en) * 2020-12-23 2021-03-19 中译语通科技股份有限公司 Method and system for judging domain map relation based on open encyclopedia map
CN112528045B (en) * 2020-12-23 2024-04-02 中译语通科技股份有限公司 Method and system for judging domain map relation based on open encyclopedia map
CN114003734A (en) * 2021-11-22 2022-02-01 四川大学华西医院 Breast cancer risk factor knowledge system model, knowledge map system and construction method
CN114003734B (en) * 2021-11-22 2023-06-30 四川大学华西医院 Knowledge system and knowledge map system of breast cancer risk factors and construction method

Also Published As

Publication number Publication date
CN107463607B (en) 2020-07-31

Similar Documents

Publication Publication Date Title
CN107463607A (en) The domain entities hyponymy of bluebeard compound vector sum bootstrapping study obtains and method for organizing
CN106777274B (en) A kind of Chinese tour field knowledge mapping construction method and system
CN110442760B (en) Synonym mining method and device for question-answer retrieval system
CN107766324B (en) Text consistency analysis method based on deep neural network
CN109829159B (en) Integrated automatic lexical analysis method and system for ancient Chinese text
CN112214610B (en) Entity relationship joint extraction method based on span and knowledge enhancement
CN107861939A (en) A kind of domain entities disambiguation method for merging term vector and topic model
Demir et al. Improving named entity recognition for morphologically rich languages using word embeddings
CN107122340B (en) A kind of similarity detection method of the science and technology item return based on synonym analysis
CN109271506A (en) A kind of construction method of the field of power communication knowledge mapping question answering system based on deep learning
CN108052593A (en) A kind of subject key words extracting method based on descriptor vector sum network structure
CN109960800A (en) Weakly supervised file classification method and device based on Active Learning
CN107992633A (en) Electronic document automatic classification method and system based on keyword feature
CN107818164A (en) A kind of intelligent answer method and its system
CN106844658A (en) A kind of Chinese text knowledge mapping method for auto constructing and system
CN107992542A (en) A kind of similar article based on topic model recommends method
CN103646112B (en) Dependency parsing field self-adaption method based on web search
CN107122349A (en) A kind of feature word of text extracting method based on word2vec LDA models
CN110516074B (en) Website theme classification method and device based on deep learning
CN106599054A (en) Method and system for title classification and push
CN108509409A (en) A method of automatically generating semantic similarity sentence sample
CN102750316A (en) Concept relation label drawing method based on semantic co-occurrence model
CN105404674B (en) Knowledge-dependent webpage information extraction method
CN106096005A (en) A kind of rubbish mail filtering method based on degree of depth study and system
CN110472203B (en) Article duplicate checking and detecting method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Yu Zhengtao

Inventor after: Ma Xiaojun

Inventor after: Guo Jianyi

Inventor after: Chen Wei

Inventor after: Zhang Zhikun

Inventor before: Guo Jianyi

Inventor before: Ma Xiaojun

Inventor before: Yu Zhengtao

Inventor before: Chen Wei

Inventor before: Zhang Zhikun

GR01 Patent grant
GR01 Patent grant