CN107463607A - The domain entities hyponymy of bluebeard compound vector sum bootstrapping study obtains and method for organizing - Google Patents
The domain entities hyponymy of bluebeard compound vector sum bootstrapping study obtains and method for organizing Download PDFInfo
- Publication number
- CN107463607A CN107463607A CN201710484051.XA CN201710484051A CN107463607A CN 107463607 A CN107463607 A CN 107463607A CN 201710484051 A CN201710484051 A CN 201710484051A CN 107463607 A CN107463607 A CN 107463607A
- Authority
- CN
- China
- Prior art keywords
- hyponymy
- bootstrapping
- entity
- domain
- domain entities
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 73
- 239000013598 vector Substances 0.000 title claims abstract description 45
- 235000001486 Salvia viridis Nutrition 0.000 title claims abstract description 17
- 150000001875 compounds Chemical class 0.000 title claims abstract description 17
- 241000531229 Caryopteris x clandonensis Species 0.000 title claims abstract 9
- 238000013507 mapping Methods 0.000 claims abstract description 28
- 239000011159 matrix material Substances 0.000 claims abstract description 26
- 238000000605 extraction Methods 0.000 claims abstract description 7
- 238000013519 translation Methods 0.000 claims abstract description 5
- 238000012549 training Methods 0.000 claims description 17
- 239000000463 material Substances 0.000 claims description 12
- 230000006870 function Effects 0.000 claims description 5
- 230000002452 interceptive effect Effects 0.000 claims description 3
- 238000003064 k means clustering Methods 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 238000012216 screening Methods 0.000 claims description 3
- 230000007704 transition Effects 0.000 claims description 3
- 238000010801 machine learning Methods 0.000 abstract description 2
- 238000003058 natural language processing Methods 0.000 abstract description 2
- 238000011160 research Methods 0.000 abstract 1
- 244000258070 Salvia viridis Species 0.000 description 8
- 239000000284 extract Substances 0.000 description 8
- 238000002474 experimental method Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 238000010276 construction Methods 0.000 description 3
- 235000019013 Viburnum opulus Nutrition 0.000 description 2
- 244000071378 Viburnum opulus Species 0.000 description 2
- 230000009193 crawling Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Abstract
The present invention relates to a kind of domain entities hyponymy acquisition of bluebeard compound vector sum bootstrapping study and method for organizing, belong to natural language processing and machine learning techniques field.The present invention is first according to the mode of bootstrapping study, the hyponymy example of candidate is obtained from the text of tour field, using the hyponymy example of candidate, artificial constructed tour field knowledge base, use for reference mapping matrix and hierarchical relationship tissue is carried out to candidate's hyponymy example.The present invention realizes effective extraction to hyponymy, strong support is provided for work such as information extraction, information retrieval and machine translation, compared with current recognition methods, accuracy of the invention, recall rate, F values are improved, therefore the present invention has certain Research Significance.
Description
Technical field
The present invention relates to the domain entities hyponymy acquisition of bluebeard compound vector sum bootstrapping study and method for organizing, belong to
Natural language processing and machine learning techniques field.
Background technology
Hyponymy is a kind of basic semantic relation, is usually used in body, knowledge base, the structure of dictionary and checking.From
From the perspective of technology is realized, the acquisition that hyponymy is retrieved as other information provides important support, and it is to body, knowledge
Storehouse, dictionary carry out correctness detection, and it expanded and perfect.And noun phrase can be obtained, is particularly not logged in
The semantic information of word, semantic relation between more concepts can be obtained by extension.On the whole, it is to know that hyponymy, which obtains,
Know obtain in one it is basic and crucial the problem of, yes-no format information is converted to the important step during formatted message
Suddenly, it is that further information processing such as the providing the foundation property such as data base querying, data mining, text mining is supported.Simultaneously
Hyponymy obtains can also play certain support work to the realization of information retrieval, knowledge question, individual info service etc.
With.
The content of the invention
The invention provides a kind of domain entities hyponymy acquisition of bluebeard compound vector sum bootstrapping study and organizer
Method, high to language material degree of dependence for solving traditional hyponymy abstracting method, extraction efficiency is than relatively low influence.
The technical scheme is that:A kind of domain entities hyponymy of bluebeard compound vector sum bootstrapping study obtain with
Method for organizing, methods described comprise the following steps that:
Step1, the mode learnt first according to bootstrapping, the hyponymy that candidate is obtained from the text of tour field are real
Example;
Step1.1, first manual compiling crawlers, tour field text envelope is crawled from tour site and encyclopaedia entry
Breath;
The present invention considers that due to different structure of web page the position and label crawled in crawlers is also different, and does not have
There is ready-made program, therefore to carry out writing program for crawling different task.Different travel networks are comprehensively chosen as far as possible
The language material of page subject matter.Such as Baidupedia entry, tourism info web etc..
Step1.2, the preprocessing process of language material are completed using the kit Ansj that increases income, including participle, part-of-speech tagging, are gone
Stop words and name Entity recognition process;
There are some repeated pages, web page tag, idle character etc. in the operative function that the present invention considers to crawl to make an uproar
Sound, these noises are invalid.Therefore, to obtain comprising only the high-quality of tour field by filtering, going the operations such as noise to remove
The text level language material of amount.
The step Step1.2's concretely comprises the following steps:
Step1.2.1, the web page text information crawled is effectively filtered, remove idle character and webpage;
Step1.2.2, duplicate removal is carried out to obtained effective web, goes junk information pretreatment operation;
Step1.2.3, using Ansj participle instruments operative function segmented, part-of-speech tagging, remove stop words and name
The process of Entity recognition.
Step1.3, word can be characterized as to highdensity low-dimensional real number vector due to term vector, can be very good to characterize word
The information of morphology, syntax and semantic aspect between language, therefore Google Open-Source Tools bag word2vec is selected, use
Skip-gram models carry out term vector model training to pretreated language material;
The training process of term vector model in the present invention, it is premise and basis that hyponymy extracts work, being can not
The step lacked, simultaneously because Chinese is mainly made up of character, for English, the semantic relation expression of intercharacter is answered
It is miscellaneous, therefore when Chinese text is expressed as term vector, it is necessary to first carry out word segmentation processing.After being segmented using participle instrument, need
Manually to proofread.
Step1.4, pretreated document is scanned, filters out the sentence simultaneously containing two or more domain entities
Son, selected characteristic context;
The step Step1.4's concretely comprises the following steps:
Step1.4.1, cutting is carried out to text in units of sentence, and carry out artificial entity mark;
Step1.4.2, finally treated document is scanned, filtered out simultaneously containing two or more domain entities
Sentence, choose the word BEF before first entity, two inter-entity word BET and second entity after word AFT as in feature
Hereafter.
Step1.5, subset acquisition, i.e. each context text are remaining after stop words and adjective is removed
Each word is converted into single term vector, then carries out simple combination and obtains characteristic vector, then using three vectors
Combination represent any relationship example;
The acquisition of subset in the present invention, while be also premise and basis that hyponymy extracts work, it is indispensable
A few step, the acquisition of subset are the keys of bootstrapping study Boost algorithm, only obtain the subset of high quality, could extract
Go out the hyponymy decimation pattern of high quality.
Step1.6, hyponymy subset is obtained from Step1.5, generated using the method for Single-pass clusters
Hyponymy decimation pattern;The input of algorithm is the list of seed relationship example, and output is relational scheme set.
The step Step1.6's concretely comprises the following steps:
First Step1.6.1, definition example belong to first new empty cluster (pattern);
Step1.6.2, traversal seed example list, calculate the similarity of any number of sub-instance and each clustering cluster.If
Similarity is more than some threshold value, then this kind of sub-instance is added in the clustering cluster (pattern), otherwise creates new clustering cluster (mould
Formula).
Step1.6.3, in order to prevent error pattern to be added into set of patterns, herein using marking by the way of enter row mode
Screening.
The acquisition of decimation pattern in process of the present invention, it is main to consider that high quality hyponymy decimation pattern is obtained.
Step1.7, after decimation pattern is obtained using Step1.6, waited using the method for new relation case-based system
Select the acquisition of relationship example;The input of algorithm is candidate sentence subset and relational scheme set, is exported as candidate relationship example.
The step Step1.7's concretely comprises the following steps:
The document that Step1.7.1, scanning do not mark, obtain all semantic type identicals with relationship example in subset
Paragraph.
Step1.7.2, for each paragraph, the generating process such as Step1.6 of relationship example.
If Step1.7.3, relationship example and some pattern similarity are more than or equal to threshold value, then relationship example is just
It is considered as a candidate translation example.
Step2, the hyponymy example using candidate, artificial constructed tour field knowledge base, use for reference mapping matrix pair
Candidate's hyponymy example carries out hierarchical relationship tissue;
The training data of Step2.1, artificial constructed domain knowledge base as mapping matrix;
The step Step2.1's concretely comprises the following steps:
Step2.1.1, manual compiling crawlers, tour field text envelope is crawled from tour site and encyclopaedia entry
Breath;
Step2.1.2, completed using the kit Ansj that increases income, including participle, part-of-speech tagging, count word frequency, obtain and
The high vocabulary of seed co-occurrence collects as domain term;
Based on Step2.1.3, the taxonomic hierarchies using interactive encyclopaedia, construct comprising 10000 domain entities
Tour field knowledge base.
Step2.2, by the cluster to training data and corresponding mapping training, judge whether two given entities are deposited
Hierarchical relationship tissue is carried out in hyponymy.
The step Step2.2's concretely comprises the following steps:
Step2.2.1, K cluster center of mass point is randomly choosed from data set, by hyponymy entity to (x, y) basis
Vector offset y-x is clustered using K-means clustering methods;
Step2.2.2, each cluster obtained for Step2.2.1 steps learn a mapping respectivelyMake Φ k*Value is minimum;
Wherein, Φk *Mapping matrix is represented, (x, y) represents hyponymy pair, | | Φkx-y||2Mean given entity x
With its hypernym y, matrix Φ be presentkSo that y=ΦkX, wherein x represent y hyponym, and y is x hypernym, ΦkRepresent
Transition matrix;NkIt is cluster gathering CkThe quantity of entity pair in k-th of clustering cluster;
Step2.2.3, after Step2.2.2 steps obtain every a kind of mapping matrix, judge new word to whether structure
Into hyponymy;
Step2.2.4, using collision problem in didactic rule process hierarchical structure, when occurring ring in figure,
Remove or overturn most weak side, the most weak side of upset forms an indirect hyponymy, can so ensure final level
Change the constraints that structure meets oriented five rings figure.
It is mainly to judge whether two given words have hyponymy, by gathering to training data in the present invention
Class, after mapping matrix corresponding to every One class learning, it is possible to judge whether new word forms hyponymy, give two
Word x and y, we find the cluster C nearest from their vector offset amount y-x firstk, and obtain corresponding mapping matrix Φk, such as
Fruit y is an x hypernym, it is necessary to meets two conditions:
Condition one, mapping matrix ΦkSo that ΦkX is sufficiently close to y.
Condition two,Meet transitivity.
The beneficial effects of the invention are as follows:
1st, the domain entities hyponymy of bluebeard compound vector sum of the invention bootstrapping study obtains and method for organizing, and existing
Some hyponymy abstracting methods compare, and improve the accuracy of hyponymy extraction, and the present invention achieves preferably
Effect;
2nd, the domain entities hyponymy of bluebeard compound vector sum of the invention bootstrapping study obtains and method for organizing, and existing
Some hyponymy abstracting methods are compared, and vocabulary is shown as to the form of term vector herein, learn Bootload by booting
Decimation pattern is obtained, can more preferably extract domain entities hyponymy.
3rd, the domain entities hyponymy acquisition of bluebeard compound vector sum of the invention bootstrapping study and method for organizing, to upper
The effective extraction of the next relational implementation, it is information extraction, information retrieval, machine translation and the knowledge mapping structure of follow-up work
Strong support is provided Deng work.
Brief description of the drawings
Fig. 1 is total flow chart in the present invention;
Fig. 2 is the certain fields knowledge base Semantic hierarchy figure in the present invention;
Fig. 3 is the semantic hierarchies system construction example of domain entities.
Embodiment
Embodiment 1:As Figure 1-3, a kind of domain entities hyponymy of bluebeard compound vector sum bootstrapping study obtains
With method for organizing, methods described comprises the following steps that:
Step1, the mode learnt first according to bootstrapping, the hyponymy that candidate is obtained from the text of tour field are real
Example;
Step1.1, first manual compiling crawlers, tour field text envelope is crawled from tour site and encyclopaedia entry
Breath;
The present invention considers that due to different structure of web page the position and label crawled in crawlers is also different, and does not have
There is ready-made program, therefore to carry out writing program for crawling different task.Different travel networks are comprehensively chosen as far as possible
The language material of page subject matter.Such as Baidupedia entry, tourism info web etc..
Step1.2, the preprocessing process of language material are completed using the kit Ansj that increases income, including participle, part-of-speech tagging, are gone
Stop words and name Entity recognition process;
There are some repeated pages, web page tag, idle character etc. in the operative function that the present invention considers to crawl to make an uproar
Sound, these noises are invalid.Therefore, to obtain comprising only the high-quality of tour field by filtering, going the operations such as noise to remove
The text level language material of amount.
As the further scheme of the present invention, the step Step1.2's concretely comprises the following steps:
Step1.2.1, the web page text information crawled is effectively filtered, remove idle character and webpage;
Step1.2.2, duplicate removal is carried out to obtained effective web, goes junk information pretreatment operation;
Step1.2.3, using Ansj participle instruments operative function segmented, part-of-speech tagging, remove stop words and name
The process of Entity recognition.
Step1.3, word can be characterized as to highdensity low-dimensional real number vector due to term vector, can be very good to characterize word
The information of morphology, syntax and semantic aspect between language, therefore Google Open-Source Tools bag word2vec is selected, use
Skip-gram models carry out term vector model training to pretreated language material;
The training process of term vector model in the present invention, it is premise and basis that hyponymy extracts work, being can not
The step lacked, simultaneously because Chinese is mainly made up of character, for English, the semantic relation expression of intercharacter is answered
It is miscellaneous, therefore when Chinese text is expressed as term vector, it is necessary to first carry out word segmentation processing.After being segmented using participle instrument, need
Manually to proofread.
Step1.4, pretreated document is scanned, filters out the sentence simultaneously containing two or more domain entities
Son, selected characteristic context;
As the further scheme of the present invention, the step Step1.4's concretely comprises the following steps:
Step1.4.1, cutting is carried out to text in units of sentence, and carry out artificial entity mark;
Step1.4.2, finally treated document is scanned, filtered out simultaneously containing two or more domain entities
Sentence, choose the word BEF before first entity, two inter-entity word BET and second entity after word AFT as in feature
Hereafter.
Step1.5, subset acquisition, i.e. each context text are remaining after stop words and adjective is removed
Each word is converted into single term vector, then carries out simple combination and obtains characteristic vector, then using three vectors
Combination represent any relationship example;
The acquisition of subset in the present invention, while be also premise and basis that hyponymy extracts work, it is indispensable
A few step, the acquisition of subset are the keys of bootstrapping study Boost algorithm, only obtain the subset of high quality, could extract
Go out the hyponymy decimation pattern of high quality.
Step1.6, hyponymy subset is obtained from Step1.5, generated using the method for Single-pass clusters
Hyponymy decimation pattern;The input of algorithm is the list of seed relationship example, and output is relational scheme set.
As the further scheme of the present invention, the step Step1.6's concretely comprises the following steps:
First Step1.6.1, definition example belong to first new empty cluster (pattern);
Step1.6.2, traversal seed example list, calculate the similarity of any number of sub-instance and each clustering cluster.If
Similarity is more than some threshold value, then this kind of sub-instance is added in the clustering cluster (pattern), otherwise creates new clustering cluster (mould
Formula).
Step1.6.3, in order to prevent error pattern to be added into set of patterns, herein using marking by the way of enter row mode
Screening.
The acquisition of decimation pattern in process of the present invention, it is main to consider that high quality hyponymy decimation pattern is obtained.
Step1.7, after decimation pattern is obtained using Step1.6, waited using the method for new relation case-based system
Select the acquisition of relationship example;The input of algorithm is candidate sentence subset and relational scheme set, is exported as candidate relationship example.
As the further scheme of the present invention, the step Step1.7's concretely comprises the following steps:
The document that Step1.7.1, scanning do not mark, obtain all semantic type identicals with relationship example in subset
Paragraph.
Step1.7.2, for each paragraph, the generating process such as Step1.6 of relationship example.
If Step1.7.3, relationship example and some pattern similarity are more than or equal to threshold value, then relationship example is just
It is considered as a candidate translation example.
Step2, the hyponymy example using candidate, artificial constructed tour field knowledge base, use for reference mapping matrix pair
Candidate's hyponymy example carries out hierarchical relationship tissue;
The training data of Step2.1, artificial constructed domain knowledge base as mapping matrix;
As the further scheme of the present invention, the step Step2.1's concretely comprises the following steps:
Step2.1.1, manual compiling crawlers, tour field text envelope is crawled from tour site and encyclopaedia entry
Breath;
Step2.1.2, completed using the kit Ansj that increases income, including participle, part-of-speech tagging, count word frequency, obtain and
The high vocabulary of seed co-occurrence collects as domain term;
Based on Step2.1.3, the taxonomic hierarchies using interactive encyclopaedia, construct comprising 10000 domain entities
Tour field knowledge base.
In order to learn mapping matrix, we are artificial constructed training number of the small-scale domain knowledge base as mapping matrix
According to.Herein on the basis of domain attribute and industry attribute is analysed in depth, Manual definition's domain knowledge system, assembling sphere correlation
Concept seed set, and the small-scale domain knowledge base of resource construction using network encyclopaedia is aided in, construct comprising 10000
The tour field knowledge base of domain entities.Part of tourism domain knowledge base Semantic hierarchy figure is as shown in Figure 2;
Step2.2, by the cluster to training data and corresponding mapping training, judge whether two given entities are deposited
Hierarchical relationship tissue is carried out in hyponymy.
Trained by the cluster to training data and corresponding mapping, it is possible to go to judge whether two given entities are deposited
In hyponymy.Fig. 3 is shown using the mapping matrix trained to domain entities hyponymy and progress hierarchical system
Structure.
As the further scheme of the present invention, the step Step2.2's concretely comprises the following steps:
Step2.2.1, K cluster center of mass point is randomly choosed from data set, by hyponymy entity to (x, y) basis
Vector offset y-x is clustered using K-means clustering methods;
Step2.2.2, each cluster obtained for Step2.2.1 steps learn a mapping respectivelyMake Φk *Value is minimum;
Wherein, Φk *Mapping matrix is represented, (x, y) represents hyponymy pair, | | Φkx-y||2Mean given entity x
With its hypernym y, matrix Φ be presentkSo that y=ΦkX, wherein x represent y hyponym, and y is x hypernym, ΦkRepresent
Transition matrix;NkIt is cluster gathering CkThe quantity of entity pair in k-th of clustering cluster;
Step2.2.3, after Step2.2.2 steps obtain every a kind of mapping matrix, judge new word to whether structure
Into hyponymy;
Step2.2.4, using collision problem in didactic rule process hierarchical structure, when occurring ring in figure,
Remove or overturn most weak side, the most weak side of upset forms an indirect hyponymy, can so ensure final level
Change the constraints that structure meets oriented five rings figure.
The present embodiment constructs tour field knowledge base of the scale for 10000 domain entities, is this patent mapping matrix
Study provide the support of language material;
In order to verify the effect of the name entity of the invention identified, unified evaluation criterion will be used:Accuracy
(Precision), recall rate (Recall), F values weigh performance of the invention as the evaluation criterion of the present invention.
The present invention is in order to verify that the validity of the invention, possible designs following groups are verified:
Experiment one:The influence of performance is extracted to verify three kinds of features to domain entities hyponymy, is selected respectively herein
Two kinds of different weight parameters are taken, as shown in Conf1 and Conf2:
Conf1:α=0.1, β=0.8, γ=0.1
Conf2:α=0.2, β=0.6, γ=0.2
Wherein, Conf1 only includes the context of two inter-entity words (BET), and Conf2 is included above and below three all category features
Literary information.Here accuracy rate, recall rate and F values is the average value under TOP5 patterns.Experimental result is as shown in table 1.
The different characteristic of table 1 extracts performance impact to domain entities hyponymy
Parameter | P (%) | R (%) | F (%) |
Conf1 | 85.8 | 70.2 | 77.2 |
Conf2 | 79.4 | 63.5 | 70.6 |
From the experimental data of table 1, for most types of hyponymy pattern, Conf2 parameter settings are taken
The recall rate obtained will be less than Conf1 parameters.By being found to analysis of experimental results, chief reason is exactly because BEF and AFT
Context data is too sparse, contains and does not much have contributive word to relation between entity pair.Test result indicates that entity pair
Between context words have prior effect to the identification of the hyponymy of entity pair.
Experiment two:In order to verify set forth herein method feasibility, contrast examination is carried out on identical experimental data set
Test.The TOP5 of selection mode cluster result is tested.Experimental result is as shown in table 2.
The comparison of the different hyponymy abstracting methods of table 2
In table 2 as can be seen that for pattern clustering result TOP5, compared with Snowball algorithms, set forth herein side
Method all achieves relatively good F values.And for has something to do pattern, or even achieve be higher by than Snowball 20% F values.
Test result indicates that the effect of entity hyponymy with the semantic feature of representation language, can be improved using term vector model.
Experiment three:In order to verify influence of the domain knowledge base to hierarchical relationship system construction, experiment is respectively in the field of addition
Knowledge base and carry out in the case of being not added with two kinds of people's domain knowledge base, experimental result is as shown in table 3
Influence of the domain knowledge base of table 3 to domain entities hyponymy tissue
Method | P (%) | R (%) | F (%) |
Term vector | 75.3 | 67.5 | 69.3 |
Term vector+knowledge base | 78.3 | 79.8 | 79.0 |
As shown in Table 3, in the case where adding domain knowledge base and entering row constraint, the recall rate of method has very big carry
It is high, it was demonstrated that hyponymy tissue important role of the domain knowledge base to domain entities
Experiment four in order to verify the feasibility set forth herein method, by set forth herein method and rule-based method,
Method based on CRF is compared, and experimental result is as shown in table 4.
The domain entities hyponymy recognition result of table 4
Method | P (%) | R (%) | F (%) |
Rule-based method | 84.4 | 48.9 | 61.9 |
Based on CRF methods | 75.1 | 72.4 | 73.7 |
Context of methods | 78.2 | 79.8 | 79.0 |
As shown in Table 4, compared with rule-based method, although set forth herein the method accuracy rate based on term vector on
It is slightly lower, but in recall rate far beyond rule-based method.And with based on stacking condition random field method ratio, herein
The method condition of proposition all increases in accuracy rate and recall rate.Test result indicates that set forth herein method in field
The physical hierarchy system feasibility in structure task automatically.
Above in conjunction with accompanying drawing to the present invention embodiment be explained in detail, but the present invention be not limited to it is above-mentioned
Embodiment, can also be before present inventive concept not be departed from those of ordinary skill in the art's possessed knowledge
Put that various changes can be made.
Claims (7)
1. the domain entities hyponymy of bluebeard compound vector sum bootstrapping study obtains and method for organizing, it is characterised in that:It is described
Method comprises the following steps that:
Step1, the mode learnt first according to bootstrapping, the hyponymy example of candidate is obtained from the text of tour field;
Step1.1, first manual compiling crawlers, tour field text message is crawled from tour site and encyclopaedia entry;
Step1.2, the preprocessing process of language material are completed using the kit Ansj that increases income, including participle, part-of-speech tagging, go to disable
Word and name Entity recognition process;
Step1.3, the Open-Source Tools bag word2vec for selecting Google, using Skip-gram models to pretreated language material
Carry out term vector model training;
Step1.4, pretreated document is scanned, filters out the sentence simultaneously containing two or more domain entities, selected
Take feature context;
Step1.5, subset acquisition, i.e. each context text are remaining each after stop words and adjective is removed
Word is all converted into single term vector, then carries out simple combination and obtains characteristic vector, then using three vectorial groups
Close to represent any relationship example;
Step1.6, hyponymy subset is obtained from Step1.5, generated up and down using the method for Single-pass clusters
Position Relation extraction pattern;
Step1.7, after decimation pattern is obtained using Step1.6, candidate pass is carried out using the method for new relation case-based system
It is the acquisition of example;
Step2, the hyponymy example using candidate, artificial constructed tour field knowledge base, mapping matrix are used for reference to candidate
Hyponymy example carries out hierarchical relationship tissue;
The training data of Step2.1, artificial constructed domain knowledge base as mapping matrix;
Step2.2, by the cluster to training data and corresponding mapping training, judge two given entities with the presence or absence of upper
The next relation carries out hierarchical relationship tissue.
2. the domain entities hyponymy of bluebeard compound vector sum bootstrapping study according to claim 1 obtains and organizer
Method, it is characterised in that:The step Step1.2's concretely comprises the following steps:
Step1.2.1, the web page text information crawled is effectively filtered, remove idle character and webpage;
Step1.2.2, duplicate removal is carried out to obtained effective web, goes junk information pretreatment operation;
Step1.2.3, using Ansj participle instrument operative function is segmented, part-of-speech tagging, go stop words and name entity
The process of identification.
3. the domain entities hyponymy of bluebeard compound vector sum bootstrapping study according to claim 1 obtains and organizer
Method, it is characterised in that:The step Step1.4's concretely comprises the following steps:
Step1.4.1, cutting is carried out to text in units of sentence, and carry out artificial entity mark;
Step1.4.2, finally treated document is scanned, filters out the sentence simultaneously containing two or more domain entities
Son, choose the word BEF before first entity, two inter-entity word BET and second entity after word AFT as feature above and below
Text.
4. the domain entities hyponymy of bluebeard compound vector sum bootstrapping study according to claim 1 obtains and organizer
Method, it is characterised in that:The step Step1.6's concretely comprises the following steps:
First Step1.6.1, setting example belong to first new empty cluster pattern;
Step1.6.2, traversal seed example list, calculate the similarity of any number of sub-instance and each clustering cluster;It is if similar
Degree is more than some threshold value, then this kind of sub-instance is added in the clustering cluster pattern, otherwise creates new clustering cluster pattern;
Step1.6.3, in order to prevent error pattern to be added into set of patterns, the screening of row mode is entered by the way of marking.
5. the domain entities hyponymy of bluebeard compound vector sum bootstrapping study according to claim 1 obtains and organizer
Method, it is characterised in that:The step Step1.7's concretely comprises the following steps:
The document that Step1.7.1, scanning do not mark, obtain all semantic type identical sections with relationship example in subset
Fall;
Step1.7.2, for each paragraph, the generating process such as step Step1.6 of relationship example;
If Step1.7.3, relationship example and some pattern similarity are more than or equal to threshold value, then relationship example is just recognized
To be a candidate translation example.
6. the domain entities hyponymy of bluebeard compound vector sum bootstrapping study according to claim 1 obtains and organizer
Method, it is characterised in that:The step Step2.1's concretely comprises the following steps:
Step2.1.1, manual compiling crawlers, tour field text message is crawled from tour site and encyclopaedia entry;
Step2.1.2, completed using the kit Ansj that increases income, including participle, part-of-speech tagging, count word frequency, obtained and seed
The high vocabulary of co-occurrence collects as domain term;
Based on Step2.1.3, the taxonomic hierarchies using interactive encyclopaedia, the tourism for including 10000 domain entities is constructed
Domain knowledge base.
7. the domain entities hyponymy of bluebeard compound vector sum bootstrapping study according to claim 1 obtains and organizer
Method, it is characterised in that:The step Step2.2's concretely comprises the following steps:
Step2.2.1, K cluster center of mass point is randomly choosed from data set, by hyponymy entity to (x, y) according to vector
Skew y-x is clustered using K-means clustering methods;
Step2.2.2, each cluster obtained for Step2.2.1 steps learn a mapping respectivelyMake Φk *Value is minimum;
Wherein, Φk *Mapping matrix is represented, (x, y) represents hyponymy pair, | | Φkx-y||2Mean given entity x and it
Hypernym y, matrix Φ be presentkSo that y=ΦkX, wherein x represent y hyponym, and y is x hypernym, ΦkRepresent transition
Matrix;NkIt is cluster gathering CkThe quantity of entity pair in k-th of clustering cluster;
Step2.2.3, after Step2.2.2 steps obtain every a kind of mapping matrix, judge new word to whether forming
The next relation;
Step2.2.4, using collision problem in didactic rule process hierarchical structure, when occurring ring in figure, remove
Or the side that upset is most weak, the most weak side of upset form an indirect hyponymy.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710484051.XA CN107463607B (en) | 2017-06-23 | 2017-06-23 | Method for acquiring and organizing upper and lower relations of domain entities by combining word vectors and bootstrap learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710484051.XA CN107463607B (en) | 2017-06-23 | 2017-06-23 | Method for acquiring and organizing upper and lower relations of domain entities by combining word vectors and bootstrap learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107463607A true CN107463607A (en) | 2017-12-12 |
CN107463607B CN107463607B (en) | 2020-07-31 |
Family
ID=60546337
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710484051.XA Active CN107463607B (en) | 2017-06-23 | 2017-06-23 | Method for acquiring and organizing upper and lower relations of domain entities by combining word vectors and bootstrap learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107463607B (en) |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108280221A (en) * | 2018-02-08 | 2018-07-13 | 北京百度网讯科技有限公司 | Stratification construction method, device and the computer equipment of focus |
CN108763192A (en) * | 2018-04-18 | 2018-11-06 | 达而观信息科技(上海)有限公司 | Entity relation extraction method and device for text-processing |
CN108874878A (en) * | 2018-05-03 | 2018-11-23 | 众安信息技术服务有限公司 | A kind of building system and method for knowledge mapping |
CN108897857A (en) * | 2018-06-28 | 2018-11-27 | 东华大学 | The Chinese Text Topic sentence generating method of domain-oriented |
CN108959258A (en) * | 2018-07-02 | 2018-12-07 | 昆明理工大学 | It is a kind of that entity link method is integrated based on the specific area for indicating to learn |
CN109086328A (en) * | 2018-06-29 | 2018-12-25 | 北京百度网讯科技有限公司 | A kind of determination method, apparatus, server and the storage medium of hyponymy |
CN109408642A (en) * | 2018-08-30 | 2019-03-01 | 昆明理工大学 | A kind of domain entities relation on attributes abstracting method based on distance supervision |
CN109446530A (en) * | 2018-11-03 | 2019-03-08 | 上海犀语科技有限公司 | It is a kind of based on LSTM model by the method and device of Extracting Information in text |
CN109492098A (en) * | 2018-10-24 | 2019-03-19 | 北京工业大学 | Target corpus base construction method based on Active Learning and semantic density |
CN109522547A (en) * | 2018-10-23 | 2019-03-26 | 浙江大学 | Chinese synonym iteration abstracting method based on pattern learning |
CN109522418A (en) * | 2018-11-08 | 2019-03-26 | 杭州费尔斯通科技有限公司 | A kind of automanual knowledge mapping construction method |
CN109740149A (en) * | 2018-12-11 | 2019-05-10 | 英大传媒投资集团有限公司 | A kind of synonym extracting method based on remote supervisory |
CN110059310A (en) * | 2018-01-19 | 2019-07-26 | 腾讯科技(深圳)有限公司 | Extending method and device, electronic equipment, the storage medium of hypernym network |
CN110209832A (en) * | 2018-08-08 | 2019-09-06 | 腾讯科技(北京)有限公司 | Method of discrimination, system and the computer equipment of hyponymy |
CN112528045A (en) * | 2020-12-23 | 2021-03-19 | 中译语通科技股份有限公司 | Method and system for judging domain map relation based on open encyclopedia map |
CN114003734A (en) * | 2021-11-22 | 2022-02-01 | 四川大学华西医院 | Breast cancer risk factor knowledge system model, knowledge map system and construction method |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2015072899A1 (en) * | 2013-11-15 | 2015-05-21 | Telefonaktiebolaget L M Ericsson (Publ) | Methods and devices for bootstrapping of resource constrained devices |
CN106095736A (en) * | 2016-06-07 | 2016-11-09 | 华东师范大学 | A kind of method of field neologisms extraction |
CN106844413A (en) * | 2016-11-11 | 2017-06-13 | 南京缘长信息科技有限公司 | The method and device of entity relation extraction |
-
2017
- 2017-06-23 CN CN201710484051.XA patent/CN107463607B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2015072899A1 (en) * | 2013-11-15 | 2015-05-21 | Telefonaktiebolaget L M Ericsson (Publ) | Methods and devices for bootstrapping of resource constrained devices |
CN106095736A (en) * | 2016-06-07 | 2016-11-09 | 华东师范大学 | A kind of method of field neologisms extraction |
CN106844413A (en) * | 2016-11-11 | 2017-06-13 | 南京缘长信息科技有限公司 | The method and device of entity relation extraction |
Non-Patent Citations (1)
Title |
---|
汪沛 等: "一种结合词向量和图模型的特定领域实体消歧方法", 《智能系统学报》 * |
Cited By (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110059310A (en) * | 2018-01-19 | 2019-07-26 | 腾讯科技(深圳)有限公司 | Extending method and device, electronic equipment, the storage medium of hypernym network |
CN110059310B (en) * | 2018-01-19 | 2022-10-28 | 腾讯科技(深圳)有限公司 | Hypernym network expansion method and device, electronic equipment and storage medium |
CN108280221B (en) * | 2018-02-08 | 2022-04-15 | 北京百度网讯科技有限公司 | Method and device for hierarchically constructing focus points and computer equipment |
CN108280221A (en) * | 2018-02-08 | 2018-07-13 | 北京百度网讯科技有限公司 | Stratification construction method, device and the computer equipment of focus |
CN108763192A (en) * | 2018-04-18 | 2018-11-06 | 达而观信息科技(上海)有限公司 | Entity relation extraction method and device for text-processing |
CN108763192B (en) * | 2018-04-18 | 2022-04-19 | 达而观信息科技(上海)有限公司 | Entity relation extraction method and device for text processing |
CN108874878A (en) * | 2018-05-03 | 2018-11-23 | 众安信息技术服务有限公司 | A kind of building system and method for knowledge mapping |
CN108897857B (en) * | 2018-06-28 | 2021-08-27 | 东华大学 | Chinese text subject sentence generating method facing field |
CN108897857A (en) * | 2018-06-28 | 2018-11-27 | 东华大学 | The Chinese Text Topic sentence generating method of domain-oriented |
CN109086328A (en) * | 2018-06-29 | 2018-12-25 | 北京百度网讯科技有限公司 | A kind of determination method, apparatus, server and the storage medium of hyponymy |
CN108959258B (en) * | 2018-07-02 | 2021-06-18 | 昆明理工大学 | Specific field integrated entity linking method based on representation learning |
CN108959258A (en) * | 2018-07-02 | 2018-12-07 | 昆明理工大学 | It is a kind of that entity link method is integrated based on the specific area for indicating to learn |
CN110209832A (en) * | 2018-08-08 | 2019-09-06 | 腾讯科技(北京)有限公司 | Method of discrimination, system and the computer equipment of hyponymy |
CN109408642B (en) * | 2018-08-30 | 2021-07-16 | 昆明理工大学 | Domain entity attribute relation extraction method based on distance supervision |
CN109408642A (en) * | 2018-08-30 | 2019-03-01 | 昆明理工大学 | A kind of domain entities relation on attributes abstracting method based on distance supervision |
CN109522547A (en) * | 2018-10-23 | 2019-03-26 | 浙江大学 | Chinese synonym iteration abstracting method based on pattern learning |
CN109492098B (en) * | 2018-10-24 | 2022-05-06 | 北京工业大学 | Target language material library construction method based on active learning and semantic density |
CN109492098A (en) * | 2018-10-24 | 2019-03-19 | 北京工业大学 | Target corpus base construction method based on Active Learning and semantic density |
CN109446530A (en) * | 2018-11-03 | 2019-03-08 | 上海犀语科技有限公司 | It is a kind of based on LSTM model by the method and device of Extracting Information in text |
CN109522418A (en) * | 2018-11-08 | 2019-03-26 | 杭州费尔斯通科技有限公司 | A kind of automanual knowledge mapping construction method |
CN109740149A (en) * | 2018-12-11 | 2019-05-10 | 英大传媒投资集团有限公司 | A kind of synonym extracting method based on remote supervisory |
CN109740149B (en) * | 2018-12-11 | 2019-12-13 | 英大传媒投资集团有限公司 | remote supervision-based synonym extraction method |
CN112528045A (en) * | 2020-12-23 | 2021-03-19 | 中译语通科技股份有限公司 | Method and system for judging domain map relation based on open encyclopedia map |
CN112528045B (en) * | 2020-12-23 | 2024-04-02 | 中译语通科技股份有限公司 | Method and system for judging domain map relation based on open encyclopedia map |
CN114003734A (en) * | 2021-11-22 | 2022-02-01 | 四川大学华西医院 | Breast cancer risk factor knowledge system model, knowledge map system and construction method |
CN114003734B (en) * | 2021-11-22 | 2023-06-30 | 四川大学华西医院 | Knowledge system and knowledge map system of breast cancer risk factors and construction method |
Also Published As
Publication number | Publication date |
---|---|
CN107463607B (en) | 2020-07-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107463607A (en) | The domain entities hyponymy of bluebeard compound vector sum bootstrapping study obtains and method for organizing | |
CN106777274B (en) | A kind of Chinese tour field knowledge mapping construction method and system | |
CN110442760B (en) | Synonym mining method and device for question-answer retrieval system | |
CN107766324B (en) | Text consistency analysis method based on deep neural network | |
CN109829159B (en) | Integrated automatic lexical analysis method and system for ancient Chinese text | |
CN112214610B (en) | Entity relationship joint extraction method based on span and knowledge enhancement | |
CN107861939A (en) | A kind of domain entities disambiguation method for merging term vector and topic model | |
Demir et al. | Improving named entity recognition for morphologically rich languages using word embeddings | |
CN107122340B (en) | A kind of similarity detection method of the science and technology item return based on synonym analysis | |
CN109271506A (en) | A kind of construction method of the field of power communication knowledge mapping question answering system based on deep learning | |
CN108052593A (en) | A kind of subject key words extracting method based on descriptor vector sum network structure | |
CN109960800A (en) | Weakly supervised file classification method and device based on Active Learning | |
CN107992633A (en) | Electronic document automatic classification method and system based on keyword feature | |
CN107818164A (en) | A kind of intelligent answer method and its system | |
CN106844658A (en) | A kind of Chinese text knowledge mapping method for auto constructing and system | |
CN107992542A (en) | A kind of similar article based on topic model recommends method | |
CN103646112B (en) | Dependency parsing field self-adaption method based on web search | |
CN107122349A (en) | A kind of feature word of text extracting method based on word2vec LDA models | |
CN110516074B (en) | Website theme classification method and device based on deep learning | |
CN106599054A (en) | Method and system for title classification and push | |
CN108509409A (en) | A method of automatically generating semantic similarity sentence sample | |
CN102750316A (en) | Concept relation label drawing method based on semantic co-occurrence model | |
CN105404674B (en) | Knowledge-dependent webpage information extraction method | |
CN106096005A (en) | A kind of rubbish mail filtering method based on degree of depth study and system | |
CN110472203B (en) | Article duplicate checking and detecting method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB03 | Change of inventor or designer information | ||
CB03 | Change of inventor or designer information |
Inventor after: Yu Zhengtao Inventor after: Ma Xiaojun Inventor after: Guo Jianyi Inventor after: Chen Wei Inventor after: Zhang Zhikun Inventor before: Guo Jianyi Inventor before: Ma Xiaojun Inventor before: Yu Zhengtao Inventor before: Chen Wei Inventor before: Zhang Zhikun |
|
GR01 | Patent grant | ||
GR01 | Patent grant |