CN106569993A

CN106569993A - Method and device for mining hypernym-hyponym relation between domain-specific terms

Info

Publication number: CN106569993A
Application number: CN201510652163.2A
Authority: CN
Inventors: 黄毅; 邓路; 夏爽
Original assignee: China Mobile Communications Group Co Ltd
Current assignee: China Mobile Communications Group Co Ltd
Priority date: 2015-10-10
Filing date: 2015-10-10
Publication date: 2017-04-19

Abstract

The invention provides a method and device for mining a hypernym-hyponym relation between domain-specific terms. The method includes: acquiring an entry explanation statement where first domain-specific terms are located, on a word bank page, according to a plurality of first preset domain-specific terms, wherein the first preset domain-specific terms are associates with first preset domain-specific term meaning; and acquiring a hypernym-hyponym relation between the first domain-specific terms and words included in the entry explanation statement by using a hypernym-hyponym relation model file which is generated in advance by a condition random field CRF tool. According to the scheme of the invention, the entry explanation statement where first domain-specific terms are located, on a word bank page is used, training and learning is performed by using a CRF robot learning technology, the model file is established finally, the hypernym-hyponym relation between the first domain-specific terms and the words included in the entry explanation statement is acquired by using the model file, and the accuracy of acquisition of the hypernym-hyponym relation can be improved.

Description

The method and device of hyponymy between a kind of excavation applications term

Technical field

The present invention relates to hyponymy between data service technical field, more particularly to a kind of excavation applications term Method and device.

Background technology

Hyponymy is the one kind in semantic relation, is usually used in the structure of dictionary, body and knowledge base and complete It is kind.Hyponymy in body learning is referred to：Two vocabulary D and U are given, for the two vocabulary Expressed implication, includes D if there is U, then it is assumed that U and D have hyponymy, and U is D Upperseat concept, D is the subordinate concept of U, is denoted as ISA (D, U).Such as ISA (carbon dioxide, temperature Room gas), ISA (4G travelling merchants' set meal, tariff package).By hyponymy apply to search engine or from In the middle of the query expansion of dynamic response, for example, when user searches for " 4G travelling merchants' set meal ", it is possible to use search The upperseat concept " tariff package " of Suo Shiti pushes more, wider array of relevant informations to user, enriches user Search experience, strengthen user's viscosity.

At present, external that the research that hyponymy between concept is obtained is had a lot, conventional method has：It is based on The method of template, the mixing based on the method for dictionary, Statistics-Based Method or these methods.However, It is specifically designed for the research that hyponymy is obtained between Chinese text concept also little, mainly uses based on mould The method of plate and the method based on dictionary.

Wherein, based on the basic thought of template method it is：It is using text analysis technique, total from the text of field Language mode that some frequently occur is born as rule, then by pattern match judging word in text Whether sequence matches certain pattern, so as to identify corresponding hyponymy.This method can be little In the case of supervising professional and predefined knowledge, knowledge is effectively extracted, while turn avoid using complexity Relation between natural language understanding models treated concept.

Wherein, the method based on dictionary is typically according to defined in some existing artificial constructed lexicon dictionaries Synonym, the knowledge such as near synonym and antonym to be obtaining the relation between Ontological concept.In terms of English, example The classification relation between concept is obtained as using the English dictionary (WordNet) based on cognitive linguistics. Chinese aspect, such as Dong Zhen east have write a general field dictionary (HowNet), with justice in the dictionary Elite tree is describing the relation between vocabulary:For the term after each disambiguation, with the language determined after its disambiguation Justice is the center of circle, is drawn as radius with the exploration depth that user specifies and is justified, finds out what concept set included in circle Concept.Finally these relations are added in the set of relationship of body.

But, between existing Ontological concept, hyponymy method for digging is all simply to rely on domain expert mostly Priori and rule match, there is problems with：

First, the method based on template：Due to lacking right in the complexity and pattern matching process of Chinese language Semantic analysis so that a large amount of useless concepts extract knot so as to greatly reduce to can also be matched out The accuracy of fruit；Again due to Chinese syntax form of diverse, it is difficult to construct the complete set of modes of comparison.

Secondly, the method based on dictionary：The cost of dictionary, the cost prohibitive safeguarded and update, and logical It is difficult to find with there are many professional field vocabulary in domain lexicon；In the face of during the application of different field, to concrete The explanation of vocabulary might not meet the requirement of specific area, and it is tired that this method for being all based on dictionary is faced It is difficult.

Therefore, based on the hyponymy acquisition methods for being used at present, all there is accuracy rate and recall rate ratio Relatively low problem, i.e., due to true language text in its own particularity and complexity so that upper bottom The excavation of relation is very challenging.

The content of the invention

In order to overcome the above-mentioned problems in the prior art, The embodiment provides a kind of excavate neck The method and device of hyponymy between the term of domain, by entry on the dictionary page explain sentence and condition with The combination of airport (CRF) instrument so that the hyponymy between the field term of acquisition is more accurate.

In order to solve above-mentioned technical problem, the present invention is adopted the following technical scheme that：

According to the one side of the embodiment of the present invention, there is provided hyponymy between a kind of excavation applications term Method, including：

According to the multiple first predetermined field terms, the first field term place entry solution on the dictionary page is gathered Sentence is released, first field term is and the semantic related word of the described first predetermined field term；

Using the hyponymy model file for generating using CRF instruments in advance, first field is obtained Term and the entry explain the hyponymy between the word that sentence includes.

Wherein, it is in such scheme, described according to the multiple first predetermined field terms, gather on the dictionary page Before the step of first field term place entry explains sentence, methods described also includes：

The hyponymy model file is generated using the CRF instruments.

Wherein, it is in such scheme, described that the hyponymy model text is generated using the CRF instruments The step of part, includes：

According to the multiple second predetermined field terms, the second field term place entry solution on the dictionary page is gathered Sentence is released, and is preserved as training sentence, second field term is and the described second predetermined field term Semantic related word；

The training sentence is generated into training file according to default CRF training file formats；

According to the training file and predetermined template file, generated using the CRF instruments described Hyponymy model file.

Wherein, it is in such scheme, described that the training sentence is given birth to according to default CRF training file formats The step of into training file, includes：

To the training sentence word segmentation processing, and part-of-speech tagging is carried out, obtain first and mark sentence；

Described first mark sentence is generated into training text to be marked according to the default CRF training file format Part, and other features in the file to be marked in addition to word feature and part of speech feature are labeled, it is raw Into training file.

Wherein, it is in such scheme, described that described first mark sentence is literary according to the default CRF training Part form generates training file to be marked, and in the file to be marked in addition to word feature and part of speech feature Other features be labeled, generate training file the step of include：

Described first mark sentence is generated into training text to be marked according to the default CRF training file format Part, wherein, the training file to be marked includes multiple symbol text token positioned at different rows, each institute Stating token includes multiple characteristic serieses and result output token row, and the plurality of characteristic series represents different respectively Feature, wherein, the feature includes word feature, part of speech feature, Feature Words dictionary feature and pointing information Feature, and institute's predicate feature column corresponding described first marks a word or the mark that sentence includes Point symbol, the part of speech of the content of the part of speech feature column correspondence institute predicate feature column；

Feature Words are extracted from the described first mark sentence, is stored in Feature Words dictionary；

According to the Feature Words dictionary, judge each token's in the training file to be marked Whether the content of word feature column is Feature Words, if so, then in the feature column of the Feature Words dictionary Mark 1, otherwise marks 0；

The content of word feature column for judging each token in the training file to be marked is No is colon, is if so, then labeled as 1 in the pointing information feature column, is otherwise labeled as 0；

Judge it is described it is to be marked training file in multiple described token word feature column content in be No is the word with hyponymy, if so, then in the corresponding result output of word for belonging to upperseat concept Flag column is labeled as U, is labeled as D in the corresponding result output token row of word for belonging to subordinate concept, if No, then the result output token row in the token are labeled as 0.

Wherein, it is in such scheme, described to extract Feature Words from the described first mark sentence, it is stored in feature Step in word dictionary includes：

Filter it is described first mark sentence in start-stop tagged words, noun, adjective, English, the time, Ah Arabic numbers and punctuation mark, and using remaining word in the described first mark sentence as candidate feature word；

Count the word frequency number of the candidate feature word, the word frequency number is the candidate feature word described first The number of times occurred in mark sentence；

According to the semantic function of the candidate feature word, exceed the candidate feature of predetermined threshold from word frequency number In word, Feature Words are filtered out, and is stored in the Feature Words dictionary.

Wherein, it is in such scheme, described according to the training file and the template file, the employing institute State the step of CRF instruments generate hyponymy model file and be：

The multiple feature templates included according to the template file by the training file each described in Institute's predicate feature that token includes, the part of speech feature, the Feature Words dictionary are special, pointing information is special Levy and result output token row are input in the CRF instruments, and institute is exported from the CRF instruments Hyponymy model file is stated, wherein, the feature templates are carried in the input CRF instruments The sliding distance and column position information of next feature, the sliding distance are in the training file, to work as On the basis of front feature, the unit line number slided forward or backward, the column position information include institute's predicate feature, The part of speech feature, the Feature Words dictionary are special, pointing information feature and the result output token are arranged.

Wherein, it is in such scheme, described according to the multiple first predetermined field terms, gather on the dictionary page The step of first field term place entry explains sentence includes：

The dictionary page related to the described first default field term is captured by web crawlers, and obtains described The entry that the first default field term is located described in the dictionary page explains sentence；

In the entry that the described first default field term is located explains sentence, the field with hyperlink is obtained Term；

The dictionary page related to the field term with hyperlink is captured by web crawlers, and is obtained The entry solution being located to the semantic related word of the field term with hyperlink on the dictionary page Sentence is released, the default crawl depth until reaching the web crawlers.

Wherein, it is in such scheme, described using the hyponymy model for being generated using CRF instruments in advance File, obtains first field term and the entry explains the upper bottom between the word that sentence includes The step of relation, includes：

Sentence carries out word segmentation processing to be explained to the entry, and carries out part-of-speech tagging, obtained second and mark sentence；

Described second mark sentence is generated into test file, the survey according to default CRF test files form Examination file includes multiple token, and each token includes two row, two row represent respectively word feature with Part of speech feature, and the word or that institute's predicate feature column correspondence the second mark sentence includes Individual punctuation mark, the part of speech of the content of the part of speech feature column correspondence institute predicate feature column；

The test file is tested using the hyponymy model file, obtain the test text Hyponymy between the word that part includes.

According to the other side of the embodiment of the present invention, upper the next pass between a kind of excavation applications term is additionally provided The device of system, including：

Acquisition module, for according to the multiple first predetermined field terms, gathering the first field on the dictionary page Term place entry explains sentence, and first field term is and the described first predetermined field term semanteme phase The word of pass；

Relation acquisition module, for the hyponymy model file for utilizing advance CRF instruments to generate, obtains Take first field term and the entry explains the hyponymy between the word that sentence includes.

Wherein, in such scheme, described device also includes：

Model building module, for generating the hyponymy model file using the CRF instruments.

Wherein, in such scheme, the model building module includes：

Collecting unit, for according to the multiple second predetermined field terms, gathering the second field on the dictionary page Term place entry explains sentence, and preserves as training sentence, and second field term is and described the The semantic related word of two predetermined field terms；

File generating unit, for the training sentence is generated instruction according to default CRF training file formats Practice file；

Relation acquisition unit, for according to the training file and predetermined template file, using described CRF instruments generate the hyponymy model file.

Wherein, in such scheme, the file generating unit includes：

First mark subelement, for training sentence word segmentation processing to described, and carries out part-of-speech tagging, obtains First mark sentence；

Second mark subelement, for the described first mark sentence is trained file according to the default CRF Form generates training file to be marked, and in the file to be marked in addition to word feature and part of speech feature Other features are labeled, and generate training file.

Wherein, in such scheme, it is described second mark subelement specifically for：

Wherein, in such scheme, the Feature Words are by filtering the start-stop mark in the first mark sentence Note word, noun, adjective, English, time, Arabic numerals and punctuation mark, and described first is marked In note sentence, remaining word is used as candidate feature word；Then, the word frequency number of the candidate feature word is counted, And according to the semantic function of the candidate feature word, exceed the candidate feature word of predetermined threshold from word frequency number In filter out, wherein, the word frequency number is that the candidate feature word occurs in the described first mark sentence Number of times.

Wherein, in such scheme, the Relation acquisition unit specifically for：

Wherein, in such scheme, the acquisition module includes：

First placement unit, for capturing the word related to the described first default field term by web crawlers The storehouse page, and obtain the entry explanation sentence that the first default field term is located described in the dictionary page；

First acquisition unit, the entry for being located in the described first default field term are explained in sentence, are obtained Take the field term with hyperlink；

Second placement unit is related to the field term with hyperlink for being captured by web crawlers The dictionary page, and obtain semantic to the field term with hyperlink related on the dictionary page The entry that word is located explains sentence, the default crawl depth until reaching the web crawlers.

Wherein, in such scheme, the Relation acquisition module includes：

Participle unit, for explaining that sentence carries out word segmentation processing to the entry, and carries out part-of-speech tagging, obtains Obtain the second mark sentence；

Format conversion unit, for the described second mark sentence is given birth to according to default CRF test files form Into test file, the test file includes multiple token, and each token includes two row, described two Row are represented in word feature and part of speech feature, and institute's predicate feature column correspondence the second mark sentence respectively Including a word or a punctuation mark, part of speech feature column correspondence institute predicate feature column Content part of speech；

Test cell, for being tested to the test file using the hyponymy model file, Obtain the hyponymy between the word that the test file includes.

The beneficial effect of the embodiment of the present invention is：

The method of hyponymy between the excavation applications term of the embodiment of the present invention, using the neck on the dictionary page The characteristics of entry of domain term generally comprises the hyponymy between field term in explaining sentence, with from dictionary Based on the entry that the field term gathered on the page is located explains sentence, with reference to the upper of CRF instruments generation The next relational model file, obtains the hyponymy between field term, has broken ontological construction field term Between hyponymy method for digging feature is single and the simple general layout of algorithm, improve hyponymy acquisition Accuracy rate.

Description of the drawings

Fig. 1 represents the method flow diagram of hyponymy between the excavation applications term of the embodiment of the present invention；

Fig. 2 represents the structured flowchart of the device of hyponymy between the excavation applications term of the embodiment of the present invention；

Fig. 3 represents the structured flowchart of the model building module of the embodiment of the present invention；

Fig. 4 represents the structured flowchart of the file generating unit of the embodiment of the present invention；

Fig. 5 represents the structured flowchart of the acquisition module of the embodiment of the present invention；

Fig. 6 represents the structured flowchart of the Relation acquisition module of the embodiment of the present invention；

Fig. 7 represents that the device of hyponymy between the excavation applications term of the embodiment of the present invention is applied to search Application principle schematic diagram in engine.

Specific embodiment

The exemplary embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although showing in accompanying drawing The exemplary embodiment of the disclosure, it being understood, however, that may be realized in various forms the disclosure and should not be by Embodiments set forth here is limited.On the contrary, there is provided these embodiments are able to be best understood from this It is open, and the scope of the present disclosure complete can be conveyed to those skilled in the art.

Embodiment one

According to the one side of the embodiment of the present invention, there is provided hyponymy between a kind of excavation applications term Method, the method first, according to the multiple first predetermined field terms, gather the first field on the dictionary page Term place entry explains sentence, and first field term is and the described first predetermined field term semanteme phase The word of pass；Then, using the hyponymy model file for being generated using CRF instruments in advance, obtain First field term and the entry explain the hyponymy between the word that sentence includes.

Therefore, between the excavation applications term of the embodiment of the present invention hyponymy method, using the dictionary page In field term entry explain sentence, be trained and learn using CRF machine learning techniques, most Model file is set up eventually, and obtains the word that field term is included with entry explanation sentence using the model file The hyponymy of language, improves the accuracy rate of hyponymy acquisition.

As shown in figure 1, the method includes：

Step S11, according to the multiple first predetermined field terms, gather the first field term on the dictionary page Place entry explains sentence.

Wherein, the dictionary page includes Baidupedia and wikipedia etc..Such as encyclopaedia business card in the dictionary page Part is typically the explanation or explanation of this page association area term, is usually contained between field term among these Hyponymy.For example, show on the Baidupedia page：" EI Nino, also known as holy baby's phenomenon, is Peru, the fisherman of one band of Ecuador are to call a kind of noun of extreme climate phenomenon." wherein, just wrap Contain.Therefore, encyclopaedia business card The expression way of regularization is excavated with very strong help for hyponymy, therefore the present invention is implemented Between the excavation applications term of example, the method for hyponymy, using encyclopaedia card information as target language material, that is, makees For the basis of hyponymy between excavation applications term.

First predetermined field term is pre-selected according to the required field term for determining hyponymy, For example, field term A1, A2, A3, A4 and A5 this five is pre-selected as the first field art Language, the collection on the Baidupedia page explain language in the entry that semantically related word is located with A1～A5 Sentence, wherein, it is exactly first field term in semantically related word with A1～A5.

Wherein, step S11 includes：

For example, the related Baidupedia pages of field term A1～A5 are captured firstly, for by web crawlers Face, and obtain the entry explanation sentence being located to the semantic related words (for example, B1～B5) of A1～A5； Then, the word (for example, C1～C5) with hyperlink is obtained in the entry for obtaining explains sentence；Again It is secondary, the Baidupedia page related to C1～C5 is further captured by web crawlers, and in the Baidu hundred The entry being located to the semantic related words of C1～C5 is obtained on section's page and explains sentence, until reaching the net The default crawl depth of network reptile.

Wherein, before step S11, methods described also includes：

The hyponymy model file is generated using the CRF instruments.

Need to train file and template file using CRF instrument generation model files.Wherein, train file Corpus are needed, and between the excavation applications term of the embodiment of the present invention in the method for hyponymy, will The entry that the field term obtained from the dictionary page is located explains sentence as corpus.

Therefore, include the step of the generation hyponymy model file using the CRF instruments：

Wherein, the second predetermined field term is different field arts from the described above first predetermined field term Language.For example, the set in a field term includes this hundred words of X1～X100, then can be with 80 words are selected from this field term set as the second default field term, and according to this 80 The entry that the word related to this 80 phrase semantics is located on the individual word collection dictionary page explains sentence, And using the sentence of collection as training sentence.And remaining 20 words are used as in the field term set One default field term, goes the entry explanation sentence related to this 20 words is gathered on the dictionary page, and Using the sentence of collection as test statement.Certainly, for the training gatherer process of sentence and adopting for test statement Collection process can be carried out simultaneously, also can step in the same manner, carry out respectively.

Wherein, the step of training sentence being generated into training file according to default CRF training file formats, Specifically include：

First, to the training sentence word segmentation processing, and part-of-speech tagging is carried out, obtains first and mark sentence； For example, a certain bar training sentence is for " Renewable resource is referred to can be with constantly regenerating, forever continuous sharp in nature The energy, with it is inexhaustible the characteristics of." so, word segmentation processing simultaneously carries out part of speech mark After note, annotation results are：" Renewable resource/n is /v refers to/v in/p natures/n/f can with/v constantly/d Regeneration/v ,/w continue forever/d utilization/v /the uj energy/n ,/w with/v it is inexhaustible/i ,/w is nexhaustible/i / uj features/n.Wherein, the English mark behind oblique line represents and position adjacent with the oblique line to/w " respectively The part of speech of the content before the oblique line.Specifically, v represents verb, and p represents preposition, and n represents noun, F represents the noun of locality, and d represents adverbial word, and w represents punctuation mark, and uj represents structural auxiliary word, and i represents Chinese idiom.

Then, the described first mark sentence is generated according to the default CRF training file format to be marked Training file；Wherein, the training file to be marked includes multiple symbol text token positioned at different rows, Each described token includes multiple characteristic serieses and result output token row, and the plurality of characteristic series is represented respectively Different features.The token that training file includes is selected to include five row in an embodiment of the present invention, this It is defeated that five leus represent word feature, part of speech feature, Feature Words dictionary feature, pointing information feature and result Go out flag column.Obtain being labeled with the first mark sentence of part of speech in previous step, therefore, can be by the first mark The word feature column of each word token different with pointing information write in note sentence, will be accordingly The part of speech of word and punctuate writes corresponding part of speech feature column.Wherein, in training file, each The row space or index symbol that token includes separates, and one sentence of a token Sequence composition, sentence Between use blank line space.Finally, need to be labeled the feature not marked in training file to be marked, It is labeled according to Feature Words dictionary feature, pointing information feature, and on result output token row mark The next relation, so as to obtain training file.Result after finally marking to training file to be marked, such as table Shown in 1.

Wherein, specific annotation process is as follows：

Wherein, in order to the word feature in of greater clarity which token known in training file The content of column belongs to field term, can will belong to the mark of the part of speech feature column of the word of field term Note is changed to S, and then is conducive to finding the mark arranged according to result output token upper and lower between field term Position relation.

The training file of table 1 finally marks example

Word	Word feature	Feature Words dictionary feature	Pointing information feature	As a result output token is arranged
					Renewable resource	S	0	0	D
It is	v	1	0	0
					Refer to	v	1	0	0
	p	1	0	0
					Nature	n	0	0	0
In	f	1	0	0
					Can be with	v	1	0	0
Constantly	d	0	0	0
					Regeneration	v	0	0	0
、	w	0	0	0
					Continue forever	d	0	0	0
Utilize	v	0	0	0
					's	uj	1	0	0
The energy	n	0	0	U
					,	w	0	0	0
Have	v	1	0	0
					It is inexhaustible	i	0	0	0
,	w	0	0	0
					It is nexhaustible	i	0	0	0
's	uj	1	0	0
					Feature	n	0	0	0
。	w	0	0	0

From the foregoing, between the excavation applications term of the embodiment of the present invention hyponymy method, from certain A relation categorizing process can be regarded in meaning as, for pass between two vocabulary in any one text System, according to object of study and target, is divided into which hyponymy, non-hyponymy.So target is closed It is that Mining Problems have translated into relation classification problem so that problem simplerization.

In addition, statistics finds, on the dictionary page, such as in the linguistic organization form of encyclopaedia business card, contain one The pattern of a little fixed, regularization, and the word of these patterns is constituted for the excavation of hyponymy has Very strong guide and guidance quality.In the explanation of nouns of encyclopaedia entry, near field term hyponymy Often there are some and there are " Feature Words " for substantially referring to effect, for example " being a kind of ", " one " etc.. Show that these " Feature Words " occurrence number in the explanation of encyclopaedia business card entry is more through summarizing, and indicate one Determine semantic relation, and implication be abstract, be frequently not that noun, adjective etc. represent tool as the word of meaning, It is made up of the word of the unrelated expression abstract conception in verb, adverbial word, preposition etc. and field.

Therefore, the Feature Words mentioned in above-mentioned can be by filtering out the start-stop labelling in the first mark sentence Word, noun, adjective, English, time, Arabic numerals and punctuation mark, and sentence is marked by first In remaining word as candidate feature word；Then, the word frequency number of candidate feature word is counted, and according to candidate The semantic function of Feature Words, exceedes from word frequency number and filters out in the candidate feature word of predetermined threshold, wherein, institute Predicate frequency is the number of times that the candidate feature word occurs in the described first mark sentence.

In addition, for template file, atomic features and assemblage characteristic can be pre-selected, and according to atomic features Feature templates are determined with assemblage characteristic, and multiple feature templates are constituted into template file.As model file Generate and select suitable characteristic set, react which so as to reach with the complicated language phenomenon of simple character representation The purpose of inherent law.

Jing statistics finds punctuation mark for the expression of semantic information has help：If two in sentence Separated with colon between term, then between them, there is hyponymy in very maximum probability.Thus, at this Between the excavation applications term of inventive embodiments in the method for hyponymy, by pointing information and word, part of speech, With Feature Words dictionary together as model atomic features.Meanwhile, in order to take into full account the impact of context, Introduce assemblage characteristic：Two units are dimensioned to for example by sliding window, i.e., on the basis of current term, Two unit distances are slid forward and backward, unit distance described here refers to two in training file OK.

With regard to feature templates, can be as shown in table 2.Wherein, as shown in Table 2, the basic format of feature templates For %x [row, col].Wherein row is used to determine the relative line number with current token, i.e. sliding distance Size；Col represents absolute columns, corresponding to each feature in input data sequence.Wherein, because Atomic features and assemblage characteristic are added in feature templates, it is possible to take into full account the contact of context, put The limitation of de- independence assumption, is obtained in that preferably mark effect, and the customization of feature templates needs Jing Cross.

2 feature templates of table

It is final obtain training file and template file after, in addition it is also necessary to according to the template file include it is many Individual feature templates by it is described training file each described token include institute's predicate feature, institute's predicate Property feature, the Feature Words dictionary special, pointing information feature and the result output token row input CRF In instrument, and the hyponymy model file is exported from the CRF instruments.

Wherein, the slip of the next feature in the input CRF instruments is carried in each feature templates Distance and column position information.The sliding distance be it is described training file in, on the basis of current signature, The unit line number slided forward or backward, the column position information include institute's predicate feature, the part of speech feature, The Feature Words dictionary is special, pointing information feature and result output token are arranged.That is character modules During what plate was represented is training file, a certain with current token is characterized as zero, next token's The coordinate position of a certain feature.

In sum, in CRF instruments, the training process of hyponymy model file generated Journey, simply can be summarised as being input into and exporting two aspects.Input is exactly Jing word segmentation processings the language for marking Material, output is exactly hyponymy model file, and the hyponymy model file is by feature function set group Into model and parameter sets.Thus the hyponymy model file for training out is used to carry out collection data Prediction, output obtain hyponymy term set.

Step S13, using the hyponymy model file for being generated using CRF instruments in advance, obtain institute State the first field term and the entry explains the hyponymy between the word that sentence includes.

Specifically, step S13 includes：

Therefore, after the entry that the first field term place is completed by the collection of step S11 explains sentence, also Need these entry mark sentences to be carried out after word segmentation processing and part-of-speech tagging, generate test file, Jin Ertong Cross hyponymy model file to test test file, so as to the hyponymy required for obtaining.

In sum, between the excavation applications term of the embodiment of the present invention hyponymy method, with semi-supervised Machine learning method carry out the relation excavation of field term, with existing based on dictionary and the method phase of rule Than being greatly saved human cost.The method of the embodiment of the present invention no longer places one's entire reliance upon domain expert's Experience and knowledge, but obtain upper and lower between field term by the machine learning non-structured web page data Position relation.The feature for having broken hyponymy method for digging between ontological construction field term is single simple with algorithm Single general layout, improves the accuracy rate of hyponymy acquisition.

In addition, the field art that the method for hyponymy is obtained between the excavation applications term of the Jing embodiment of the present invention The hyponymy of language part, can be applicable in traditional search engine or question answering system.Wherein, it is traditional to search Index hold up or question answering system in generally use matching technique based on key word, do not make full use of each Incidence relation between individual searching entities, so as to cause, hit results are few or returning result is empty generation feelings Condition.But, the method for hyponymy between the excavation applications term based on the embodiment of the present invention, user's In search or question answering process, query expansion is carried out using hyponymy, on the one hand can increase calling together for system The rate of returning, on the other hand actively can also recommend more, wider array of relevant informations to user, and abundant user's makes With experience, strengthen the viscosity of user.

Embodiment two

According to the other side of the embodiment of the present invention, upper the next pass between a kind of excavation applications term is additionally provided The device of system, as shown in Fig. 2 the device 200 includes：

Acquisition module 201, for according to the multiple first predetermined field terms, gathering first on the dictionary page Field term place entry explains sentence, and first field term is and the described first predetermined field term language Adopted related word；

Relation acquisition module 205, for utilizing the hyponymy model for generating using CRF instruments in advance File, obtains first field term and the entry explains the upper bottom between the word that sentence includes Relation.

Alternatively, as shown in Fig. 2 described device also includes：

Model building module 203, for generating the hyponymy model text using the CRF instruments Part.

Alternatively, as shown in figure 3, the model building module 203 includes：

Collecting unit 2031, for according to the multiple second predetermined field terms, gathering the on the dictionary page Two field term place entries explain sentences, and preserve as training sentence, second field term be with The related word of the second predetermined field term semanteme；

File generating unit 2032, for the training sentence is given birth to according to default CRF training file formats Into training file；

Relation acquisition unit 2033, for according to the training file and predetermined template file, adopting The hyponymy model file is generated with the CRF instruments.

Alternatively, as shown in figure 4, the file generating unit 2032 includes：

First mark subelement 20321, for training sentence word segmentation processing to described, and carries out part-of-speech tagging, Obtain first and mark sentence；

Second mark subelement 20322, for the described first mark sentence is instructed according to the default CRF Practice file format and generate training file to be marked, and to word feature and part of speech feature are removed in the file to be marked Outside other features be labeled, generate training file.

Alternatively, it is described second mark subelement 20322 specifically for：

Alternatively, the Feature Words be by filter it is described first mark sentence in start-stop tagged words, noun, Adjective, English, time, Arabic numerals and punctuation mark, and described first is marked remaining in sentence Word as candidate feature word；Then, the word frequency number of the candidate feature word is counted, and according to the time The semantic function of Feature Words is selected, exceedes what is filtered out in the candidate feature word of predetermined threshold from word frequency number, Wherein, the word frequency number is the number of times that the candidate feature word occurs in the described first mark sentence.

Alternatively, the Relation acquisition unit specifically for：

The multiple feature templates included according to the template file by the training file each described in Institute's predicate feature that token includes, the part of speech feature, the Feature Words dictionary are special, pointing information is special Levy and result output token row are input in the CRF instruments, and export on described from the CRF instruments The next relational model file, wherein, the feature templates carry next in the input CRF instruments The sliding distance and column position information of feature, the sliding distance are in the training file, with current special On the basis of levying, the unit line number slided forward or backward, the column position information include institute's predicate feature, institute State special part of speech feature, the Feature Words dictionary, the pointing information feature and result output token row.

Alternatively, as shown in figure 5, the acquisition module 201 includes：

First placement unit 2011, for being captured and the described first default field term phase by web crawlers The dictionary page of pass, and obtain the entry explanation that the first default field term is located described in the dictionary page Sentence；

First acquisition unit 2012, the entry for being located in the described first default field term explain sentence In, obtain the field term with hyperlink；

Second placement unit 2013, for being captured and the field art with hyperlink by web crawlers The related dictionary page of language, and obtain semantic with the field term with hyperlink on the dictionary page The entry that related word is located explains sentence, the default crawl depth until reaching the web crawlers.

Alternatively, as shown in fig. 6, the Relation acquisition module 205 includes：

Participle unit 2051, for explaining that sentence carries out word segmentation processing to the entry, and carries out part of speech mark Note, obtains second and marks sentence；

Format conversion unit 2052, for marking sentence according to default CRF test files lattice by described second Formula generates test file, and the test file includes multiple token, and each token includes two row, institute State two row and represent word feature and part of speech feature, and institute's predicate feature column correspondence the second mark language respectively A word or a punctuation mark that sentence includes, predicate feature institute of the part of speech feature column correspondence institute In the part of speech of the content of row；

Test cell 2053, for being carried out to the test file using the hyponymy model file Test, obtains the hyponymy between the word that the test file includes.

Wherein, between the excavation applications term of the embodiment of the present invention hyponymy device, may be used on, search Index is held up in the middle of the query expansion link with automatic question answering.For example shown in Fig. 7, it is applied in search engine. Only need to predefine the field term of multiple association areas as seed term set, be input to excavation applications Between term in the device of hyponymy, hyponymy term set is then exported, and then is saved in search In engine, so that after one search keyword of user input, search engine can be using between field term Hyponymy, is that user recommends more, wider array of relevant informations, enriches the experience of user, strengthen The viscosity of user.

Above-described is the preferred embodiment of the present invention, it should be pointed out that for the ordinary people of the art For member, some improvements and modifications can also be made under the premise of without departing from principle of the present invention, these Improvements and modifications are also within the scope of the present invention.

Claims

1. between a kind of excavation applications term hyponymy method, it is characterised in that include：

Using the hyponymy model file for generating using condition random field CRF instruments in advance, institute is obtained State the first field term and the entry explains the hyponymy between the word that sentence includes.

2. the method for claim 1, it is characterised in that described according to the multiple first predetermined field arts Language, before gathering the step of the first field term place entry on the dictionary page explains sentence, methods described Also include：

The hyponymy model file is generated using the CRF instruments.

3. method as claimed in claim 2, it is characterised in that described to be generated using the CRF instruments The step of hyponymy model file, includes：

4. method as claimed in claim 3, it is characterised in that it is described by the training sentence according to default CRF includes the step of training file format generates training file：

5. method as claimed in claim 4, it is characterised in that it is described by the described first mark sentence according to The default CRF training file format generates training file to be marked, and to removing in the file to be marked Other features outside word feature and part of speech feature are labeled, and generating the step of training file includes：

6. method as claimed in claim 5, it is characterised in that described to carry from the described first mark sentence Feature Words are taken, the step being stored in Feature Words dictionary includes：

7. method as claimed in claim 5, it is characterised in that described according to the training file and described Template file, it is described the step of adopt the CRF instruments to generate hyponymy model file for：

The multiple feature templates included according to the template file by the training file each described in Institute's predicate feature that token includes, the part of speech feature, the Feature Words dictionary are special, pointing information is special Levy and result output token row are input in the CRF instruments, and institute is exported from the CRF instruments Hyponymy model file is stated, wherein, the feature templates are carried in the input CRF instruments The sliding distance and column position information of next feature, the sliding distance are in the training file, to work as On the basis of front feature, the unit line number slided forward or backward, the column position information include institute's predicate feature, The part of speech feature, the Feature Words dictionary are special, the pointing information feature and the result output token Row.

8. the method for claim 1, it is characterised in that described according to the multiple first predetermined field arts Language, gathering the step of the first field term place entry on the dictionary page explains sentence includes：

9. the method for claim 1, it is characterised in that described using adopting condition random field in advance The hyponymy model file that CRF instruments are generated, obtains first field term and the entry solution The step of releasing the hyponymy between the word that sentence includes includes：

10. between a kind of excavation applications term hyponymy device, it is characterised in that include：

Relation acquisition module, for utilizing the upper the next pass for generating using condition random field CRF instruments in advance It is model file, obtains first field term and the entry is explained between the word that sentence includes Hyponymy.

11. devices as claimed in claim 10, it is characterised in that described device also includes：

12. devices as claimed in claim 11, it is characterised in that the model building module includes：

13. devices as claimed in claim 12, it is characterised in that the file generating unit includes：

14. devices as claimed in claim 13, it is characterised in that the second mark subelement is specifically used In：

15. devices as claimed in claim 14, it is characterised in that the Feature Words are described by filtering Start-stop tagged words, noun, adjective, English, time, Arabic numerals and mark in first mark sentence Point symbol, and using remaining word in the described first mark sentence as candidate feature word；Then, count institute The word frequency number of candidate feature word is stated, and according to the semantic function of the candidate feature word, is exceeded from word frequency number pre- Filter out in the candidate feature word for determining threshold value, wherein, the word frequency number is that the candidate feature word exists The number of times occurred in the first mark sentence.

16. devices as claimed in claim 14, it is characterised in that the Relation acquisition unit specifically for：

The multiple feature templates included according to the template file by the training file each described in Institute's predicate feature that token includes, the part of speech feature, the Feature Words dictionary are special, pointing information is special Levy and result output token row are input in the CRF instruments, and institute is exported from the CRF instruments Hyponymy model file is stated, wherein, the feature templates are carried in the input CRF instruments The sliding distance and column position information of next feature, the sliding distance are in the training file, to work as On the basis of front feature, the unit line number slided forward or backward, the column position information include institute's predicate feature, The part of speech feature, the Feature Words dictionary are special, pointing information feature and result output token are arranged.

17. devices as claimed in claim 10, it is characterised in that the acquisition module includes：

18. devices as claimed in claim 10, it is characterised in that the Relation acquisition module includes：