CN106569993A - Method and device for mining hypernym-hyponym relation between domain-specific terms - Google Patents
Method and device for mining hypernym-hyponym relation between domain-specific terms Download PDFInfo
- Publication number
- CN106569993A CN106569993A CN201510652163.2A CN201510652163A CN106569993A CN 106569993 A CN106569993 A CN 106569993A CN 201510652163 A CN201510652163 A CN 201510652163A CN 106569993 A CN106569993 A CN 106569993A
- Authority
- CN
- China
- Prior art keywords
- feature
- word
- sentence
- file
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Machine Translation (AREA)
Abstract
The invention provides a method and device for mining a hypernym-hyponym relation between domain-specific terms. The method includes: acquiring an entry explanation statement where first domain-specific terms are located, on a word bank page, according to a plurality of first preset domain-specific terms, wherein the first preset domain-specific terms are associates with first preset domain-specific term meaning; and acquiring a hypernym-hyponym relation between the first domain-specific terms and words included in the entry explanation statement by using a hypernym-hyponym relation model file which is generated in advance by a condition random field CRF tool. According to the scheme of the invention, the entry explanation statement where first domain-specific terms are located, on a word bank page is used, training and learning is performed by using a CRF robot learning technology, the model file is established finally, the hypernym-hyponym relation between the first domain-specific terms and the words included in the entry explanation statement is acquired by using the model file, and the accuracy of acquisition of the hypernym-hyponym relation can be improved.
Description
Technical field
The present invention relates to hyponymy between data service technical field, more particularly to a kind of excavation applications term
Method and device.
Background technology
Hyponymy is the one kind in semantic relation, is usually used in the structure of dictionary, body and knowledge base and complete
It is kind.Hyponymy in body learning is referred to:Two vocabulary D and U are given, for the two vocabulary
Expressed implication, includes D if there is U, then it is assumed that U and D have hyponymy, and U is D
Upperseat concept, D is the subordinate concept of U, is denoted as ISA (D, U).Such as ISA (carbon dioxide, temperature
Room gas), ISA (4G travelling merchants' set meal, tariff package).By hyponymy apply to search engine or from
In the middle of the query expansion of dynamic response, for example, when user searches for " 4G travelling merchants' set meal ", it is possible to use search
The upperseat concept " tariff package " of Suo Shiti pushes more, wider array of relevant informations to user, enriches user
Search experience, strengthen user's viscosity.
At present, external that the research that hyponymy between concept is obtained is had a lot, conventional method has:It is based on
The method of template, the mixing based on the method for dictionary, Statistics-Based Method or these methods.However,
It is specifically designed for the research that hyponymy is obtained between Chinese text concept also little, mainly uses based on mould
The method of plate and the method based on dictionary.
Wherein, based on the basic thought of template method it is:It is using text analysis technique, total from the text of field
Language mode that some frequently occur is born as rule, then by pattern match judging word in text
Whether sequence matches certain pattern, so as to identify corresponding hyponymy.This method can be little
In the case of supervising professional and predefined knowledge, knowledge is effectively extracted, while turn avoid using complexity
Relation between natural language understanding models treated concept.
Wherein, the method based on dictionary is typically according to defined in some existing artificial constructed lexicon dictionaries
Synonym, the knowledge such as near synonym and antonym to be obtaining the relation between Ontological concept.In terms of English, example
The classification relation between concept is obtained as using the English dictionary (WordNet) based on cognitive linguistics.
Chinese aspect, such as Dong Zhen east have write a general field dictionary (HowNet), with justice in the dictionary
Elite tree is describing the relation between vocabulary:For the term after each disambiguation, with the language determined after its disambiguation
Justice is the center of circle, is drawn as radius with the exploration depth that user specifies and is justified, finds out what concept set included in circle
Concept.Finally these relations are added in the set of relationship of body.
But, between existing Ontological concept, hyponymy method for digging is all simply to rely on domain expert mostly
Priori and rule match, there is problems with:
First, the method based on template:Due to lacking right in the complexity and pattern matching process of Chinese language
Semantic analysis so that a large amount of useless concepts extract knot so as to greatly reduce to can also be matched out
The accuracy of fruit;Again due to Chinese syntax form of diverse, it is difficult to construct the complete set of modes of comparison.
Secondly, the method based on dictionary:The cost of dictionary, the cost prohibitive safeguarded and update, and logical
It is difficult to find with there are many professional field vocabulary in domain lexicon;In the face of during the application of different field, to concrete
The explanation of vocabulary might not meet the requirement of specific area, and it is tired that this method for being all based on dictionary is faced
It is difficult.
Therefore, based on the hyponymy acquisition methods for being used at present, all there is accuracy rate and recall rate ratio
Relatively low problem, i.e., due to true language text in its own particularity and complexity so that upper bottom
The excavation of relation is very challenging.
The content of the invention
In order to overcome the above-mentioned problems in the prior art, The embodiment provides a kind of excavate neck
The method and device of hyponymy between the term of domain, by entry on the dictionary page explain sentence and condition with
The combination of airport (CRF) instrument so that the hyponymy between the field term of acquisition is more accurate.
In order to solve above-mentioned technical problem, the present invention is adopted the following technical scheme that:
According to the one side of the embodiment of the present invention, there is provided hyponymy between a kind of excavation applications term
Method, including:
According to the multiple first predetermined field terms, the first field term place entry solution on the dictionary page is gathered
Sentence is released, first field term is and the semantic related word of the described first predetermined field term;
Using the hyponymy model file for generating using CRF instruments in advance, first field is obtained
Term and the entry explain the hyponymy between the word that sentence includes.
Wherein, it is in such scheme, described according to the multiple first predetermined field terms, gather on the dictionary page
Before the step of first field term place entry explains sentence, methods described also includes:
The hyponymy model file is generated using the CRF instruments.
Wherein, it is in such scheme, described that the hyponymy model text is generated using the CRF instruments
The step of part, includes:
According to the multiple second predetermined field terms, the second field term place entry solution on the dictionary page is gathered
Sentence is released, and is preserved as training sentence, second field term is and the described second predetermined field term
Semantic related word;
The training sentence is generated into training file according to default CRF training file formats;
According to the training file and predetermined template file, generated using the CRF instruments described
Hyponymy model file.
Wherein, it is in such scheme, described that the training sentence is given birth to according to default CRF training file formats
The step of into training file, includes:
To the training sentence word segmentation processing, and part-of-speech tagging is carried out, obtain first and mark sentence;
Described first mark sentence is generated into training text to be marked according to the default CRF training file format
Part, and other features in the file to be marked in addition to word feature and part of speech feature are labeled, it is raw
Into training file.
Wherein, it is in such scheme, described that described first mark sentence is literary according to the default CRF training
Part form generates training file to be marked, and in the file to be marked in addition to word feature and part of speech feature
Other features be labeled, generate training file the step of include:
Described first mark sentence is generated into training text to be marked according to the default CRF training file format
Part, wherein, the training file to be marked includes multiple symbol text token positioned at different rows, each institute
Stating token includes multiple characteristic serieses and result output token row, and the plurality of characteristic series represents different respectively
Feature, wherein, the feature includes word feature, part of speech feature, Feature Words dictionary feature and pointing information
Feature, and institute's predicate feature column corresponding described first marks a word or the mark that sentence includes
Point symbol, the part of speech of the content of the part of speech feature column correspondence institute predicate feature column;
Feature Words are extracted from the described first mark sentence, is stored in Feature Words dictionary;
According to the Feature Words dictionary, judge each token's in the training file to be marked
Whether the content of word feature column is Feature Words, if so, then in the feature column of the Feature Words dictionary
Mark 1, otherwise marks 0;
The content of word feature column for judging each token in the training file to be marked is
No is colon, is if so, then labeled as 1 in the pointing information feature column, is otherwise labeled as 0;
Judge it is described it is to be marked training file in multiple described token word feature column content in be
No is the word with hyponymy, if so, then in the corresponding result output of word for belonging to upperseat concept
Flag column is labeled as U, is labeled as D in the corresponding result output token row of word for belonging to subordinate concept, if
No, then the result output token row in the token are labeled as 0.
Wherein, it is in such scheme, described to extract Feature Words from the described first mark sentence, it is stored in feature
Step in word dictionary includes:
Filter it is described first mark sentence in start-stop tagged words, noun, adjective, English, the time, Ah
Arabic numbers and punctuation mark, and using remaining word in the described first mark sentence as candidate feature word;
Count the word frequency number of the candidate feature word, the word frequency number is the candidate feature word described first
The number of times occurred in mark sentence;
According to the semantic function of the candidate feature word, exceed the candidate feature of predetermined threshold from word frequency number
In word, Feature Words are filtered out, and is stored in the Feature Words dictionary.
Wherein, it is in such scheme, described according to the training file and the template file, the employing institute
State the step of CRF instruments generate hyponymy model file and be:
The multiple feature templates included according to the template file by the training file each described in
Institute's predicate feature that token includes, the part of speech feature, the Feature Words dictionary are special, pointing information is special
Levy and result output token row are input in the CRF instruments, and institute is exported from the CRF instruments
Hyponymy model file is stated, wherein, the feature templates are carried in the input CRF instruments
The sliding distance and column position information of next feature, the sliding distance are in the training file, to work as
On the basis of front feature, the unit line number slided forward or backward, the column position information include institute's predicate feature,
The part of speech feature, the Feature Words dictionary are special, pointing information feature and the result output token are arranged.
Wherein, it is in such scheme, described according to the multiple first predetermined field terms, gather on the dictionary page
The step of first field term place entry explains sentence includes:
The dictionary page related to the described first default field term is captured by web crawlers, and obtains described
The entry that the first default field term is located described in the dictionary page explains sentence;
In the entry that the described first default field term is located explains sentence, the field with hyperlink is obtained
Term;
The dictionary page related to the field term with hyperlink is captured by web crawlers, and is obtained
The entry solution being located to the semantic related word of the field term with hyperlink on the dictionary page
Sentence is released, the default crawl depth until reaching the web crawlers.
Wherein, it is in such scheme, described using the hyponymy model for being generated using CRF instruments in advance
File, obtains first field term and the entry explains the upper bottom between the word that sentence includes
The step of relation, includes:
Sentence carries out word segmentation processing to be explained to the entry, and carries out part-of-speech tagging, obtained second and mark sentence;
Described second mark sentence is generated into test file, the survey according to default CRF test files form
Examination file includes multiple token, and each token includes two row, two row represent respectively word feature with
Part of speech feature, and the word or that institute's predicate feature column correspondence the second mark sentence includes
Individual punctuation mark, the part of speech of the content of the part of speech feature column correspondence institute predicate feature column;
The test file is tested using the hyponymy model file, obtain the test text
Hyponymy between the word that part includes.
According to the other side of the embodiment of the present invention, upper the next pass between a kind of excavation applications term is additionally provided
The device of system, including:
Acquisition module, for according to the multiple first predetermined field terms, gathering the first field on the dictionary page
Term place entry explains sentence, and first field term is and the described first predetermined field term semanteme phase
The word of pass;
Relation acquisition module, for the hyponymy model file for utilizing advance CRF instruments to generate, obtains
Take first field term and the entry explains the hyponymy between the word that sentence includes.
Wherein, in such scheme, described device also includes:
Model building module, for generating the hyponymy model file using the CRF instruments.
Wherein, in such scheme, the model building module includes:
Collecting unit, for according to the multiple second predetermined field terms, gathering the second field on the dictionary page
Term place entry explains sentence, and preserves as training sentence, and second field term is and described the
The semantic related word of two predetermined field terms;
File generating unit, for the training sentence is generated instruction according to default CRF training file formats
Practice file;
Relation acquisition unit, for according to the training file and predetermined template file, using described
CRF instruments generate the hyponymy model file.
Wherein, in such scheme, the file generating unit includes:
First mark subelement, for training sentence word segmentation processing to described, and carries out part-of-speech tagging, obtains
First mark sentence;
Second mark subelement, for the described first mark sentence is trained file according to the default CRF
Form generates training file to be marked, and in the file to be marked in addition to word feature and part of speech feature
Other features are labeled, and generate training file.
Wherein, in such scheme, it is described second mark subelement specifically for:
Described first mark sentence is generated into training text to be marked according to the default CRF training file format
Part, wherein, the training file to be marked includes multiple symbol text token positioned at different rows, each institute
Stating token includes multiple characteristic serieses and result output token row, and the plurality of characteristic series represents different respectively
Feature, wherein, the feature includes word feature, part of speech feature, Feature Words dictionary feature and pointing information
Feature, and institute's predicate feature column corresponding described first marks a word or the mark that sentence includes
Point symbol, the part of speech of the content of the part of speech feature column correspondence institute predicate feature column;
Feature Words are extracted from the described first mark sentence, is stored in Feature Words dictionary;
According to the Feature Words dictionary, judge each token's in the training file to be marked
Whether the content of word feature column is Feature Words, if so, then in the feature column of the Feature Words dictionary
Mark 1, otherwise marks 0;
The content of word feature column for judging each token in the training file to be marked is
No is colon, is if so, then labeled as 1 in the pointing information feature column, is otherwise labeled as 0;
Judge it is described it is to be marked training file in multiple described token word feature column content in be
No is the word with hyponymy, if so, then in the corresponding result output of word for belonging to upperseat concept
Flag column is labeled as U, is labeled as D in the corresponding result output token row of word for belonging to subordinate concept, if
No, then the result output token row in the token are labeled as 0.
Wherein, in such scheme, the Feature Words are by filtering the start-stop mark in the first mark sentence
Note word, noun, adjective, English, time, Arabic numerals and punctuation mark, and described first is marked
In note sentence, remaining word is used as candidate feature word;Then, the word frequency number of the candidate feature word is counted,
And according to the semantic function of the candidate feature word, exceed the candidate feature word of predetermined threshold from word frequency number
In filter out, wherein, the word frequency number is that the candidate feature word occurs in the described first mark sentence
Number of times.
Wherein, in such scheme, the Relation acquisition unit specifically for:
The multiple feature templates included according to the template file by the training file each described in
Institute's predicate feature that token includes, the part of speech feature, the Feature Words dictionary are special, pointing information is special
Levy and result output token row are input in the CRF instruments, and institute is exported from the CRF instruments
Hyponymy model file is stated, wherein, the feature templates are carried in the input CRF instruments
The sliding distance and column position information of next feature, the sliding distance are in the training file, to work as
On the basis of front feature, the unit line number slided forward or backward, the column position information include institute's predicate feature,
The part of speech feature, the Feature Words dictionary are special, pointing information feature and the result output token are arranged.
Wherein, in such scheme, the acquisition module includes:
First placement unit, for capturing the word related to the described first default field term by web crawlers
The storehouse page, and obtain the entry explanation sentence that the first default field term is located described in the dictionary page;
First acquisition unit, the entry for being located in the described first default field term are explained in sentence, are obtained
Take the field term with hyperlink;
Second placement unit is related to the field term with hyperlink for being captured by web crawlers
The dictionary page, and obtain semantic to the field term with hyperlink related on the dictionary page
The entry that word is located explains sentence, the default crawl depth until reaching the web crawlers.
Wherein, in such scheme, the Relation acquisition module includes:
Participle unit, for explaining that sentence carries out word segmentation processing to the entry, and carries out part-of-speech tagging, obtains
Obtain the second mark sentence;
Format conversion unit, for the described second mark sentence is given birth to according to default CRF test files form
Into test file, the test file includes multiple token, and each token includes two row, described two
Row are represented in word feature and part of speech feature, and institute's predicate feature column correspondence the second mark sentence respectively
Including a word or a punctuation mark, part of speech feature column correspondence institute predicate feature column
Content part of speech;
Test cell, for being tested to the test file using the hyponymy model file,
Obtain the hyponymy between the word that the test file includes.
The beneficial effect of the embodiment of the present invention is:
The method of hyponymy between the excavation applications term of the embodiment of the present invention, using the neck on the dictionary page
The characteristics of entry of domain term generally comprises the hyponymy between field term in explaining sentence, with from dictionary
Based on the entry that the field term gathered on the page is located explains sentence, with reference to the upper of CRF instruments generation
The next relational model file, obtains the hyponymy between field term, has broken ontological construction field term
Between hyponymy method for digging feature is single and the simple general layout of algorithm, improve hyponymy acquisition
Accuracy rate.
Description of the drawings
Fig. 1 represents the method flow diagram of hyponymy between the excavation applications term of the embodiment of the present invention;
Fig. 2 represents the structured flowchart of the device of hyponymy between the excavation applications term of the embodiment of the present invention;
Fig. 3 represents the structured flowchart of the model building module of the embodiment of the present invention;
Fig. 4 represents the structured flowchart of the file generating unit of the embodiment of the present invention;
Fig. 5 represents the structured flowchart of the acquisition module of the embodiment of the present invention;
Fig. 6 represents the structured flowchart of the Relation acquisition module of the embodiment of the present invention;
Fig. 7 represents that the device of hyponymy between the excavation applications term of the embodiment of the present invention is applied to search
Application principle schematic diagram in engine.
Specific embodiment
The exemplary embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although showing in accompanying drawing
The exemplary embodiment of the disclosure, it being understood, however, that may be realized in various forms the disclosure and should not be by
Embodiments set forth here is limited.On the contrary, there is provided these embodiments are able to be best understood from this
It is open, and the scope of the present disclosure complete can be conveyed to those skilled in the art.
Embodiment one
According to the one side of the embodiment of the present invention, there is provided hyponymy between a kind of excavation applications term
Method, the method first, according to the multiple first predetermined field terms, gather the first field on the dictionary page
Term place entry explains sentence, and first field term is and the described first predetermined field term semanteme phase
The word of pass;Then, using the hyponymy model file for being generated using CRF instruments in advance, obtain
First field term and the entry explain the hyponymy between the word that sentence includes.
Therefore, between the excavation applications term of the embodiment of the present invention hyponymy method, using the dictionary page
In field term entry explain sentence, be trained and learn using CRF machine learning techniques, most
Model file is set up eventually, and obtains the word that field term is included with entry explanation sentence using the model file
The hyponymy of language, improves the accuracy rate of hyponymy acquisition.
As shown in figure 1, the method includes:
Step S11, according to the multiple first predetermined field terms, gather the first field term on the dictionary page
Place entry explains sentence.
Wherein, the dictionary page includes Baidupedia and wikipedia etc..Such as encyclopaedia business card in the dictionary page
Part is typically the explanation or explanation of this page association area term, is usually contained between field term among these
Hyponymy.For example, show on the Baidupedia page:" EI Nino, also known as holy baby's phenomenon, is
Peru, the fisherman of one band of Ecuador are to call a kind of noun of extreme climate phenomenon." wherein, just wrap
Contain.Therefore, encyclopaedia business card
The expression way of regularization is excavated with very strong help for hyponymy, therefore the present invention is implemented
Between the excavation applications term of example, the method for hyponymy, using encyclopaedia card information as target language material, that is, makees
For the basis of hyponymy between excavation applications term.
First predetermined field term is pre-selected according to the required field term for determining hyponymy,
For example, field term A1, A2, A3, A4 and A5 this five is pre-selected as the first field art
Language, the collection on the Baidupedia page explain language in the entry that semantically related word is located with A1~A5
Sentence, wherein, it is exactly first field term in semantically related word with A1~A5.
Wherein, step S11 includes:
The dictionary page related to the described first default field term is captured by web crawlers, and obtains described
The entry that the first default field term is located described in the dictionary page explains sentence;
In the entry that the described first default field term is located explains sentence, the field with hyperlink is obtained
Term;
The dictionary page related to the field term with hyperlink is captured by web crawlers, and is obtained
The entry solution being located to the semantic related word of the field term with hyperlink on the dictionary page
Sentence is released, the default crawl depth until reaching the web crawlers.
For example, the related Baidupedia pages of field term A1~A5 are captured firstly, for by web crawlers
Face, and obtain the entry explanation sentence being located to the semantic related words (for example, B1~B5) of A1~A5;
Then, the word (for example, C1~C5) with hyperlink is obtained in the entry for obtaining explains sentence;Again
It is secondary, the Baidupedia page related to C1~C5 is further captured by web crawlers, and in the Baidu hundred
The entry being located to the semantic related words of C1~C5 is obtained on section's page and explains sentence, until reaching the net
The default crawl depth of network reptile.
Wherein, before step S11, methods described also includes:
The hyponymy model file is generated using the CRF instruments.
Need to train file and template file using CRF instrument generation model files.Wherein, train file
Corpus are needed, and between the excavation applications term of the embodiment of the present invention in the method for hyponymy, will
The entry that the field term obtained from the dictionary page is located explains sentence as corpus.
Therefore, include the step of the generation hyponymy model file using the CRF instruments:
According to the multiple second predetermined field terms, the second field term place entry solution on the dictionary page is gathered
Sentence is released, and is preserved as training sentence, second field term is and the described second predetermined field term
Semantic related word;
The training sentence is generated into training file according to default CRF training file formats;
According to the training file and predetermined template file, generated using the CRF instruments described
Hyponymy model file.
Wherein, the second predetermined field term is different field arts from the described above first predetermined field term
Language.For example, the set in a field term includes this hundred words of X1~X100, then can be with
80 words are selected from this field term set as the second default field term, and according to this 80
The entry that the word related to this 80 phrase semantics is located on the individual word collection dictionary page explains sentence,
And using the sentence of collection as training sentence.And remaining 20 words are used as in the field term set
One default field term, goes the entry explanation sentence related to this 20 words is gathered on the dictionary page, and
Using the sentence of collection as test statement.Certainly, for the training gatherer process of sentence and adopting for test statement
Collection process can be carried out simultaneously, also can step in the same manner, carry out respectively.
Wherein, the step of training sentence being generated into training file according to default CRF training file formats,
Specifically include:
First, to the training sentence word segmentation processing, and part-of-speech tagging is carried out, obtains first and mark sentence;
For example, a certain bar training sentence is for " Renewable resource is referred to can be with constantly regenerating, forever continuous sharp in nature
The energy, with it is inexhaustible the characteristics of." so, word segmentation processing simultaneously carries out part of speech mark
After note, annotation results are:" Renewable resource/n is /v refers to/v in/p natures/n/f can with/v constantly/d
Regeneration/v ,/w continue forever/d utilization/v /the uj energy/n ,/w with/v it is inexhaustible/i ,/w is nexhaustible/i
/ uj features/n.Wherein, the English mark behind oblique line represents and position adjacent with the oblique line to/w " respectively
The part of speech of the content before the oblique line.Specifically, v represents verb, and p represents preposition, and n represents noun,
F represents the noun of locality, and d represents adverbial word, and w represents punctuation mark, and uj represents structural auxiliary word, and i represents Chinese idiom.
Then, the described first mark sentence is generated according to the default CRF training file format to be marked
Training file;Wherein, the training file to be marked includes multiple symbol text token positioned at different rows,
Each described token includes multiple characteristic serieses and result output token row, and the plurality of characteristic series is represented respectively
Different features.The token that training file includes is selected to include five row in an embodiment of the present invention, this
It is defeated that five leus represent word feature, part of speech feature, Feature Words dictionary feature, pointing information feature and result
Go out flag column.Obtain being labeled with the first mark sentence of part of speech in previous step, therefore, can be by the first mark
The word feature column of each word token different with pointing information write in note sentence, will be accordingly
The part of speech of word and punctuate writes corresponding part of speech feature column.Wherein, in training file, each
The row space or index symbol that token includes separates, and one sentence of a token Sequence composition, sentence
Between use blank line space.Finally, need to be labeled the feature not marked in training file to be marked,
It is labeled according to Feature Words dictionary feature, pointing information feature, and on result output token row mark
The next relation, so as to obtain training file.Result after finally marking to training file to be marked, such as table
Shown in 1.
Wherein, specific annotation process is as follows:
Feature Words are extracted from the described first mark sentence, is stored in Feature Words dictionary;
According to the Feature Words dictionary, judge each token's in the training file to be marked
Whether the content of word feature column is Feature Words, if so, then in the feature column of the Feature Words dictionary
Mark 1, otherwise marks 0;
The content of word feature column for judging each token in the training file to be marked is
No is colon, is if so, then labeled as 1 in the pointing information feature column, is otherwise labeled as 0;
Judge it is described it is to be marked training file in multiple described token word feature column content in be
No is the word with hyponymy, if so, then in the corresponding result output of word for belonging to upperseat concept
Flag column is labeled as U, is labeled as D in the corresponding result output token row of word for belonging to subordinate concept, if
No, then the result output token row in the token are labeled as 0.
Wherein, in order to the word feature in of greater clarity which token known in training file
The content of column belongs to field term, can will belong to the mark of the part of speech feature column of the word of field term
Note is changed to S, and then is conducive to finding the mark arranged according to result output token upper and lower between field term
Position relation.
The training file of table 1 finally marks example
Word | Word feature | Feature Words dictionary feature | Pointing information feature | As a result output token is arranged |
Renewable resource | S | 0 | 0 | D |
It is | v | 1 | 0 | 0 |
Refer to | v | 1 | 0 | 0 |
p | 1 | 0 | 0 | |
Nature | n | 0 | 0 | 0 |
In | f | 1 | 0 | 0 |
Can be with | v | 1 | 0 | 0 |
Constantly | d | 0 | 0 | 0 |
Regeneration | v | 0 | 0 | 0 |
、 | w | 0 | 0 | 0 |
Continue forever | d | 0 | 0 | 0 |
Utilize | v | 0 | 0 | 0 |
's | uj | 1 | 0 | 0 |
The energy | n | 0 | 0 | U |
, | w | 0 | 0 | 0 |
Have | v | 1 | 0 | 0 |
It is inexhaustible | i | 0 | 0 | 0 |
, | w | 0 | 0 | 0 |
It is nexhaustible | i | 0 | 0 | 0 |
's | uj | 1 | 0 | 0 |
Feature | n | 0 | 0 | 0 |
。 | w | 0 | 0 | 0 |
From the foregoing, between the excavation applications term of the embodiment of the present invention hyponymy method, from certain
A relation categorizing process can be regarded in meaning as, for pass between two vocabulary in any one text
System, according to object of study and target, is divided into which hyponymy, non-hyponymy.So target is closed
It is that Mining Problems have translated into relation classification problem so that problem simplerization.
In addition, statistics finds, on the dictionary page, such as in the linguistic organization form of encyclopaedia business card, contain one
The pattern of a little fixed, regularization, and the word of these patterns is constituted for the excavation of hyponymy has
Very strong guide and guidance quality.In the explanation of nouns of encyclopaedia entry, near field term hyponymy
Often there are some and there are " Feature Words " for substantially referring to effect, for example " being a kind of ", " one " etc..
Show that these " Feature Words " occurrence number in the explanation of encyclopaedia business card entry is more through summarizing, and indicate one
Determine semantic relation, and implication be abstract, be frequently not that noun, adjective etc. represent tool as the word of meaning,
It is made up of the word of the unrelated expression abstract conception in verb, adverbial word, preposition etc. and field.
Therefore, the Feature Words mentioned in above-mentioned can be by filtering out the start-stop labelling in the first mark sentence
Word, noun, adjective, English, time, Arabic numerals and punctuation mark, and sentence is marked by first
In remaining word as candidate feature word;Then, the word frequency number of candidate feature word is counted, and according to candidate
The semantic function of Feature Words, exceedes from word frequency number and filters out in the candidate feature word of predetermined threshold, wherein, institute
Predicate frequency is the number of times that the candidate feature word occurs in the described first mark sentence.
In addition, for template file, atomic features and assemblage characteristic can be pre-selected, and according to atomic features
Feature templates are determined with assemblage characteristic, and multiple feature templates are constituted into template file.As model file
Generate and select suitable characteristic set, react which so as to reach with the complicated language phenomenon of simple character representation
The purpose of inherent law.
Jing statistics finds punctuation mark for the expression of semantic information has help:If two in sentence
Separated with colon between term, then between them, there is hyponymy in very maximum probability.Thus, at this
Between the excavation applications term of inventive embodiments in the method for hyponymy, by pointing information and word, part of speech,
With Feature Words dictionary together as model atomic features.Meanwhile, in order to take into full account the impact of context,
Introduce assemblage characteristic:Two units are dimensioned to for example by sliding window, i.e., on the basis of current term,
Two unit distances are slid forward and backward, unit distance described here refers to two in training file
OK.
With regard to feature templates, can be as shown in table 2.Wherein, as shown in Table 2, the basic format of feature templates
For %x [row, col].Wherein row is used to determine the relative line number with current token, i.e. sliding distance
Size;Col represents absolute columns, corresponding to each feature in input data sequence.Wherein, because
Atomic features and assemblage characteristic are added in feature templates, it is possible to take into full account the contact of context, put
The limitation of de- independence assumption, is obtained in that preferably mark effect, and the customization of feature templates needs Jing
Cross.
2 feature templates of table
It is final obtain training file and template file after, in addition it is also necessary to according to the template file include it is many
Individual feature templates by it is described training file each described token include institute's predicate feature, institute's predicate
Property feature, the Feature Words dictionary special, pointing information feature and the result output token row input CRF
In instrument, and the hyponymy model file is exported from the CRF instruments.
Wherein, the slip of the next feature in the input CRF instruments is carried in each feature templates
Distance and column position information.The sliding distance be it is described training file in, on the basis of current signature,
The unit line number slided forward or backward, the column position information include institute's predicate feature, the part of speech feature,
The Feature Words dictionary is special, pointing information feature and result output token are arranged.That is character modules
During what plate was represented is training file, a certain with current token is characterized as zero, next token's
The coordinate position of a certain feature.
In sum, in CRF instruments, the training process of hyponymy model file generated
Journey, simply can be summarised as being input into and exporting two aspects.Input is exactly Jing word segmentation processings the language for marking
Material, output is exactly hyponymy model file, and the hyponymy model file is by feature function set group
Into model and parameter sets.Thus the hyponymy model file for training out is used to carry out collection data
Prediction, output obtain hyponymy term set.
Step S13, using the hyponymy model file for being generated using CRF instruments in advance, obtain institute
State the first field term and the entry explains the hyponymy between the word that sentence includes.
Specifically, step S13 includes:
Sentence carries out word segmentation processing to be explained to the entry, and carries out part-of-speech tagging, obtained second and mark sentence;
Described second mark sentence is generated into test file, the survey according to default CRF test files form
Examination file includes multiple token, and each token includes two row, two row represent respectively word feature with
Part of speech feature, and the word or that institute's predicate feature column correspondence the second mark sentence includes
Individual punctuation mark, the part of speech of the content of the part of speech feature column correspondence institute predicate feature column;
The test file is tested using the hyponymy model file, obtain the test text
Hyponymy between the word that part includes.
Therefore, after the entry that the first field term place is completed by the collection of step S11 explains sentence, also
Need these entry mark sentences to be carried out after word segmentation processing and part-of-speech tagging, generate test file, Jin Ertong
Cross hyponymy model file to test test file, so as to the hyponymy required for obtaining.
In sum, between the excavation applications term of the embodiment of the present invention hyponymy method, with semi-supervised
Machine learning method carry out the relation excavation of field term, with existing based on dictionary and the method phase of rule
Than being greatly saved human cost.The method of the embodiment of the present invention no longer places one's entire reliance upon domain expert's
Experience and knowledge, but obtain upper and lower between field term by the machine learning non-structured web page data
Position relation.The feature for having broken hyponymy method for digging between ontological construction field term is single simple with algorithm
Single general layout, improves the accuracy rate of hyponymy acquisition.
In addition, the field art that the method for hyponymy is obtained between the excavation applications term of the Jing embodiment of the present invention
The hyponymy of language part, can be applicable in traditional search engine or question answering system.Wherein, it is traditional to search
Index hold up or question answering system in generally use matching technique based on key word, do not make full use of each
Incidence relation between individual searching entities, so as to cause, hit results are few or returning result is empty generation feelings
Condition.But, the method for hyponymy between the excavation applications term based on the embodiment of the present invention, user's
In search or question answering process, query expansion is carried out using hyponymy, on the one hand can increase calling together for system
The rate of returning, on the other hand actively can also recommend more, wider array of relevant informations to user, and abundant user's makes
With experience, strengthen the viscosity of user.
Embodiment two
According to the other side of the embodiment of the present invention, upper the next pass between a kind of excavation applications term is additionally provided
The device of system, as shown in Fig. 2 the device 200 includes:
Acquisition module 201, for according to the multiple first predetermined field terms, gathering first on the dictionary page
Field term place entry explains sentence, and first field term is and the described first predetermined field term language
Adopted related word;
Relation acquisition module 205, for utilizing the hyponymy model for generating using CRF instruments in advance
File, obtains first field term and the entry explains the upper bottom between the word that sentence includes
Relation.
Alternatively, as shown in Fig. 2 described device also includes:
Model building module 203, for generating the hyponymy model text using the CRF instruments
Part.
Alternatively, as shown in figure 3, the model building module 203 includes:
Collecting unit 2031, for according to the multiple second predetermined field terms, gathering the on the dictionary page
Two field term place entries explain sentences, and preserve as training sentence, second field term be with
The related word of the second predetermined field term semanteme;
File generating unit 2032, for the training sentence is given birth to according to default CRF training file formats
Into training file;
Relation acquisition unit 2033, for according to the training file and predetermined template file, adopting
The hyponymy model file is generated with the CRF instruments.
Alternatively, as shown in figure 4, the file generating unit 2032 includes:
First mark subelement 20321, for training sentence word segmentation processing to described, and carries out part-of-speech tagging,
Obtain first and mark sentence;
Second mark subelement 20322, for the described first mark sentence is instructed according to the default CRF
Practice file format and generate training file to be marked, and to word feature and part of speech feature are removed in the file to be marked
Outside other features be labeled, generate training file.
Alternatively, it is described second mark subelement 20322 specifically for:
Described first mark sentence is generated into training text to be marked according to the default CRF training file format
Part, wherein, the training file to be marked includes multiple symbol text token positioned at different rows, each institute
Stating token includes multiple characteristic serieses and result output token row, and the plurality of characteristic series represents different respectively
Feature, wherein, the feature includes word feature, part of speech feature, Feature Words dictionary feature and pointing information
Feature, and institute's predicate feature column corresponding described first marks a word or the mark that sentence includes
Point symbol, the part of speech of the content of the part of speech feature column correspondence institute predicate feature column;
Feature Words are extracted from the described first mark sentence, is stored in Feature Words dictionary;
According to the Feature Words dictionary, judge each token's in the training file to be marked
Whether the content of word feature column is Feature Words, if so, then in the feature column of the Feature Words dictionary
Mark 1, otherwise marks 0;
The content of word feature column for judging each token in the training file to be marked is
No is colon, is if so, then labeled as 1 in the pointing information feature column, is otherwise labeled as 0;
Judge it is described it is to be marked training file in multiple described token word feature column content in be
No is the word with hyponymy, if so, then in the corresponding result output of word for belonging to upperseat concept
Flag column is labeled as U, is labeled as D in the corresponding result output token row of word for belonging to subordinate concept, if
No, then the result output token row in the token are labeled as 0.
Alternatively, the Feature Words be by filter it is described first mark sentence in start-stop tagged words, noun,
Adjective, English, time, Arabic numerals and punctuation mark, and described first is marked remaining in sentence
Word as candidate feature word;Then, the word frequency number of the candidate feature word is counted, and according to the time
The semantic function of Feature Words is selected, exceedes what is filtered out in the candidate feature word of predetermined threshold from word frequency number,
Wherein, the word frequency number is the number of times that the candidate feature word occurs in the described first mark sentence.
Alternatively, the Relation acquisition unit specifically for:
The multiple feature templates included according to the template file by the training file each described in
Institute's predicate feature that token includes, the part of speech feature, the Feature Words dictionary are special, pointing information is special
Levy and result output token row are input in the CRF instruments, and export on described from the CRF instruments
The next relational model file, wherein, the feature templates carry next in the input CRF instruments
The sliding distance and column position information of feature, the sliding distance are in the training file, with current special
On the basis of levying, the unit line number slided forward or backward, the column position information include institute's predicate feature, institute
State special part of speech feature, the Feature Words dictionary, the pointing information feature and result output token row.
Alternatively, as shown in figure 5, the acquisition module 201 includes:
First placement unit 2011, for being captured and the described first default field term phase by web crawlers
The dictionary page of pass, and obtain the entry explanation that the first default field term is located described in the dictionary page
Sentence;
First acquisition unit 2012, the entry for being located in the described first default field term explain sentence
In, obtain the field term with hyperlink;
Second placement unit 2013, for being captured and the field art with hyperlink by web crawlers
The related dictionary page of language, and obtain semantic with the field term with hyperlink on the dictionary page
The entry that related word is located explains sentence, the default crawl depth until reaching the web crawlers.
Alternatively, as shown in fig. 6, the Relation acquisition module 205 includes:
Participle unit 2051, for explaining that sentence carries out word segmentation processing to the entry, and carries out part of speech mark
Note, obtains second and marks sentence;
Format conversion unit 2052, for marking sentence according to default CRF test files lattice by described second
Formula generates test file, and the test file includes multiple token, and each token includes two row, institute
State two row and represent word feature and part of speech feature, and institute's predicate feature column correspondence the second mark language respectively
A word or a punctuation mark that sentence includes, predicate feature institute of the part of speech feature column correspondence institute
In the part of speech of the content of row;
Test cell 2053, for being carried out to the test file using the hyponymy model file
Test, obtains the hyponymy between the word that the test file includes.
Wherein, between the excavation applications term of the embodiment of the present invention hyponymy device, may be used on, search
Index is held up in the middle of the query expansion link with automatic question answering.For example shown in Fig. 7, it is applied in search engine.
Only need to predefine the field term of multiple association areas as seed term set, be input to excavation applications
Between term in the device of hyponymy, hyponymy term set is then exported, and then is saved in search
In engine, so that after one search keyword of user input, search engine can be using between field term
Hyponymy, is that user recommends more, wider array of relevant informations, enriches the experience of user, strengthen
The viscosity of user.
Above-described is the preferred embodiment of the present invention, it should be pointed out that for the ordinary people of the art
For member, some improvements and modifications can also be made under the premise of without departing from principle of the present invention, these
Improvements and modifications are also within the scope of the present invention.
Claims (18)
1. between a kind of excavation applications term hyponymy method, it is characterised in that include:
According to the multiple first predetermined field terms, the first field term place entry solution on the dictionary page is gathered
Sentence is released, first field term is and the semantic related word of the described first predetermined field term;
Using the hyponymy model file for generating using condition random field CRF instruments in advance, institute is obtained
State the first field term and the entry explains the hyponymy between the word that sentence includes.
2. the method for claim 1, it is characterised in that described according to the multiple first predetermined field arts
Language, before gathering the step of the first field term place entry on the dictionary page explains sentence, methods described
Also include:
The hyponymy model file is generated using the CRF instruments.
3. method as claimed in claim 2, it is characterised in that described to be generated using the CRF instruments
The step of hyponymy model file, includes:
According to the multiple second predetermined field terms, the second field term place entry solution on the dictionary page is gathered
Sentence is released, and is preserved as training sentence, second field term is and the described second predetermined field term
Semantic related word;
The training sentence is generated into training file according to default CRF training file formats;
According to the training file and predetermined template file, generated using the CRF instruments described
Hyponymy model file.
4. method as claimed in claim 3, it is characterised in that it is described by the training sentence according to default
CRF includes the step of training file format generates training file:
To the training sentence word segmentation processing, and part-of-speech tagging is carried out, obtain first and mark sentence;
Described first mark sentence is generated into training text to be marked according to the default CRF training file format
Part, and other features in the file to be marked in addition to word feature and part of speech feature are labeled, it is raw
Into training file.
5. method as claimed in claim 4, it is characterised in that it is described by the described first mark sentence according to
The default CRF training file format generates training file to be marked, and to removing in the file to be marked
Other features outside word feature and part of speech feature are labeled, and generating the step of training file includes:
Described first mark sentence is generated into training text to be marked according to the default CRF training file format
Part, wherein, the training file to be marked includes multiple symbol text token positioned at different rows, each institute
Stating token includes multiple characteristic serieses and result output token row, and the plurality of characteristic series represents different respectively
Feature, wherein, the feature includes word feature, part of speech feature, Feature Words dictionary feature and pointing information
Feature, and institute's predicate feature column corresponding described first marks a word or the mark that sentence includes
Point symbol, the part of speech of the content of the part of speech feature column correspondence institute predicate feature column;
Feature Words are extracted from the described first mark sentence, is stored in Feature Words dictionary;
According to the Feature Words dictionary, judge each token's in the training file to be marked
Whether the content of word feature column is Feature Words, if so, then in the feature column of the Feature Words dictionary
Mark 1, otherwise marks 0;
The content of word feature column for judging each token in the training file to be marked is
No is colon, is if so, then labeled as 1 in the pointing information feature column, is otherwise labeled as 0;
Judge it is described it is to be marked training file in multiple described token word feature column content in be
No is the word with hyponymy, if so, then in the corresponding result output of word for belonging to upperseat concept
Flag column is labeled as U, is labeled as D in the corresponding result output token row of word for belonging to subordinate concept, if
No, then the result output token row in the token are labeled as 0.
6. method as claimed in claim 5, it is characterised in that described to carry from the described first mark sentence
Feature Words are taken, the step being stored in Feature Words dictionary includes:
Filter it is described first mark sentence in start-stop tagged words, noun, adjective, English, the time, Ah
Arabic numbers and punctuation mark, and using remaining word in the described first mark sentence as candidate feature word;
Count the word frequency number of the candidate feature word, the word frequency number is the candidate feature word described first
The number of times occurred in mark sentence;
According to the semantic function of the candidate feature word, exceed the candidate feature of predetermined threshold from word frequency number
In word, Feature Words are filtered out, and is stored in the Feature Words dictionary.
7. method as claimed in claim 5, it is characterised in that described according to the training file and described
Template file, it is described the step of adopt the CRF instruments to generate hyponymy model file for:
The multiple feature templates included according to the template file by the training file each described in
Institute's predicate feature that token includes, the part of speech feature, the Feature Words dictionary are special, pointing information is special
Levy and result output token row are input in the CRF instruments, and institute is exported from the CRF instruments
Hyponymy model file is stated, wherein, the feature templates are carried in the input CRF instruments
The sliding distance and column position information of next feature, the sliding distance are in the training file, to work as
On the basis of front feature, the unit line number slided forward or backward, the column position information include institute's predicate feature,
The part of speech feature, the Feature Words dictionary are special, the pointing information feature and the result output token
Row.
8. the method for claim 1, it is characterised in that described according to the multiple first predetermined field arts
Language, gathering the step of the first field term place entry on the dictionary page explains sentence includes:
The dictionary page related to the described first default field term is captured by web crawlers, and obtains described
The entry that the first default field term is located described in the dictionary page explains sentence;
In the entry that the described first default field term is located explains sentence, the field with hyperlink is obtained
Term;
The dictionary page related to the field term with hyperlink is captured by web crawlers, and is obtained
The entry solution being located to the semantic related word of the field term with hyperlink on the dictionary page
Sentence is released, the default crawl depth until reaching the web crawlers.
9. the method for claim 1, it is characterised in that described using adopting condition random field in advance
The hyponymy model file that CRF instruments are generated, obtains first field term and the entry solution
The step of releasing the hyponymy between the word that sentence includes includes:
Sentence carries out word segmentation processing to be explained to the entry, and carries out part-of-speech tagging, obtained second and mark sentence;
Described second mark sentence is generated into test file, the survey according to default CRF test files form
Examination file includes multiple token, and each token includes two row, two row represent respectively word feature with
Part of speech feature, and the word or that institute's predicate feature column correspondence the second mark sentence includes
Individual punctuation mark, the part of speech of the content of the part of speech feature column correspondence institute predicate feature column;
The test file is tested using the hyponymy model file, obtain the test text
Hyponymy between the word that part includes.
10. between a kind of excavation applications term hyponymy device, it is characterised in that include:
Acquisition module, for according to the multiple first predetermined field terms, gathering the first field on the dictionary page
Term place entry explains sentence, and first field term is and the described first predetermined field term semanteme phase
The word of pass;
Relation acquisition module, for utilizing the upper the next pass for generating using condition random field CRF instruments in advance
It is model file, obtains first field term and the entry is explained between the word that sentence includes
Hyponymy.
11. devices as claimed in claim 10, it is characterised in that described device also includes:
Model building module, for generating the hyponymy model file using the CRF instruments.
12. devices as claimed in claim 11, it is characterised in that the model building module includes:
Collecting unit, for according to the multiple second predetermined field terms, gathering the second field on the dictionary page
Term place entry explains sentence, and preserves as training sentence, and second field term is and described the
The semantic related word of two predetermined field terms;
File generating unit, for the training sentence is generated instruction according to default CRF training file formats
Practice file;
Relation acquisition unit, for according to the training file and predetermined template file, using described
CRF instruments generate the hyponymy model file.
13. devices as claimed in claim 12, it is characterised in that the file generating unit includes:
First mark subelement, for training sentence word segmentation processing to described, and carries out part-of-speech tagging, obtains
First mark sentence;
Second mark subelement, for the described first mark sentence is trained file according to the default CRF
Form generates training file to be marked, and in the file to be marked in addition to word feature and part of speech feature
Other features are labeled, and generate training file.
14. devices as claimed in claim 13, it is characterised in that the second mark subelement is specifically used
In:
Described first mark sentence is generated into training text to be marked according to the default CRF training file format
Part, wherein, the training file to be marked includes multiple symbol text token positioned at different rows, each institute
Stating token includes multiple characteristic serieses and result output token row, and the plurality of characteristic series represents different respectively
Feature, wherein, the feature includes word feature, part of speech feature, Feature Words dictionary feature and pointing information
Feature, and institute's predicate feature column corresponding described first marks a word or the mark that sentence includes
Point symbol, the part of speech of the content of the part of speech feature column correspondence institute predicate feature column;
Feature Words are extracted from the described first mark sentence, is stored in Feature Words dictionary;
According to the Feature Words dictionary, judge each token's in the training file to be marked
Whether the content of word feature column is Feature Words, if so, then in the feature column of the Feature Words dictionary
Mark 1, otherwise marks 0;
The content of word feature column for judging each token in the training file to be marked is
No is colon, is if so, then labeled as 1 in the pointing information feature column, is otherwise labeled as 0;
Judge it is described it is to be marked training file in multiple described token word feature column content in be
No is the word with hyponymy, if so, then in the corresponding result output of word for belonging to upperseat concept
Flag column is labeled as U, is labeled as D in the corresponding result output token row of word for belonging to subordinate concept, if
No, then the result output token row in the token are labeled as 0.
15. devices as claimed in claim 14, it is characterised in that the Feature Words are described by filtering
Start-stop tagged words, noun, adjective, English, time, Arabic numerals and mark in first mark sentence
Point symbol, and using remaining word in the described first mark sentence as candidate feature word;Then, count institute
The word frequency number of candidate feature word is stated, and according to the semantic function of the candidate feature word, is exceeded from word frequency number pre-
Filter out in the candidate feature word for determining threshold value, wherein, the word frequency number is that the candidate feature word exists
The number of times occurred in the first mark sentence.
16. devices as claimed in claim 14, it is characterised in that the Relation acquisition unit specifically for:
The multiple feature templates included according to the template file by the training file each described in
Institute's predicate feature that token includes, the part of speech feature, the Feature Words dictionary are special, pointing information is special
Levy and result output token row are input in the CRF instruments, and institute is exported from the CRF instruments
Hyponymy model file is stated, wherein, the feature templates are carried in the input CRF instruments
The sliding distance and column position information of next feature, the sliding distance are in the training file, to work as
On the basis of front feature, the unit line number slided forward or backward, the column position information include institute's predicate feature,
The part of speech feature, the Feature Words dictionary are special, pointing information feature and result output token are arranged.
17. devices as claimed in claim 10, it is characterised in that the acquisition module includes:
First placement unit, for capturing the word related to the described first default field term by web crawlers
The storehouse page, and obtain the entry explanation sentence that the first default field term is located described in the dictionary page;
First acquisition unit, the entry for being located in the described first default field term are explained in sentence, are obtained
Take the field term with hyperlink;
Second placement unit is related to the field term with hyperlink for being captured by web crawlers
The dictionary page, and obtain semantic to the field term with hyperlink related on the dictionary page
The entry that word is located explains sentence, the default crawl depth until reaching the web crawlers.
18. devices as claimed in claim 10, it is characterised in that the Relation acquisition module includes:
Participle unit, for explaining that sentence carries out word segmentation processing to the entry, and carries out part-of-speech tagging, obtains
Obtain the second mark sentence;
Format conversion unit, for the described second mark sentence is given birth to according to default CRF test files form
Into test file, the test file includes multiple token, and each token includes two row, described two
Row are represented in word feature and part of speech feature, and institute's predicate feature column correspondence the second mark sentence respectively
Including a word or a punctuation mark, part of speech feature column correspondence institute predicate feature column
Content part of speech;
Test cell, for being tested to the test file using the hyponymy model file,
Obtain the hyponymy between the word that the test file includes.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510652163.2A CN106569993A (en) | 2015-10-10 | 2015-10-10 | Method and device for mining hypernym-hyponym relation between domain-specific terms |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510652163.2A CN106569993A (en) | 2015-10-10 | 2015-10-10 | Method and device for mining hypernym-hyponym relation between domain-specific terms |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106569993A true CN106569993A (en) | 2017-04-19 |
Family
ID=58506722
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510652163.2A Pending CN106569993A (en) | 2015-10-10 | 2015-10-10 | Method and device for mining hypernym-hyponym relation between domain-specific terms |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106569993A (en) |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107357776A (en) * | 2017-06-16 | 2017-11-17 | 北京奇艺世纪科技有限公司 | A kind of related term method for digging and device |
CN108563617A (en) * | 2018-03-12 | 2018-09-21 | 北京云知声信息技术有限公司 | The method for digging and device of Chinese sentence hybrid template |
CN108733702A (en) * | 2017-04-20 | 2018-11-02 | 北京京东尚科信息技术有限公司 | User inquires method, apparatus, electronic equipment and the medium of hyponymy extraction |
CN109800308A (en) * | 2019-01-22 | 2019-05-24 | 四川长虹电器股份有限公司 | A kind of short text classification method combined based on part of speech and Fuzzy Pattern Recognition |
CN109933691A (en) * | 2019-02-11 | 2019-06-25 | 北京百度网讯科技有限公司 | Method, apparatus, equipment and storage medium for content retrieval |
CN110196982A (en) * | 2019-06-12 | 2019-09-03 | 腾讯科技(深圳)有限公司 | Hyponymy abstracting method, device and computer equipment |
CN110209822A (en) * | 2019-06-11 | 2019-09-06 | 中译语通科技股份有限公司 | Sphere of learning data dependence prediction technique based on deep learning, computer |
CN110263342A (en) * | 2019-06-20 | 2019-09-20 | 北京百度网讯科技有限公司 | Method for digging and device, the electronic equipment of the hyponymy of entity |
CN110334186A (en) * | 2019-07-08 | 2019-10-15 | 北京三快在线科技有限公司 | Data query method, apparatus, computer equipment and computer readable storage medium |
CN110674306A (en) * | 2018-06-15 | 2020-01-10 | 株式会社日立制作所 | Knowledge graph construction method and device and electronic equipment |
WO2021053511A1 (en) * | 2019-09-18 | 2021-03-25 | International Business Machines Corporation | Hypernym detection using strict partial order networks |
CN114020880A (en) * | 2022-01-06 | 2022-02-08 | 杭州费尔斯通科技有限公司 | Method, system, electronic device and storage medium for extracting hypernym |
CN114692620A (en) * | 2020-12-28 | 2022-07-01 | 阿里巴巴集团控股有限公司 | Text processing method and device |
US11556570B2 (en) | 2018-09-20 | 2023-01-17 | International Business Machines Corporation | Extraction of semantic relation |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101290626A (en) * | 2008-06-12 | 2008-10-22 | 昆明理工大学 | Text categorization feature selection and weight computation method based on field knowledge |
CN101710343A (en) * | 2009-12-11 | 2010-05-19 | 北京中机科海科技发展有限公司 | Body automatic build system and method based on text mining |
CN102117281A (en) * | 2009-12-30 | 2011-07-06 | 北京亿维讯科技有限公司 | Method for constructing domain ontology |
CN102360383A (en) * | 2011-10-15 | 2012-02-22 | 西安交通大学 | Method for extracting text-oriented field term and term relationship |
CN103699568A (en) * | 2013-11-16 | 2014-04-02 | 西安交通大学城市学院 | Method for extracting hyponymy relation of field terms from wikipedia |
-
2015
- 2015-10-10 CN CN201510652163.2A patent/CN106569993A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101290626A (en) * | 2008-06-12 | 2008-10-22 | 昆明理工大学 | Text categorization feature selection and weight computation method based on field knowledge |
CN101710343A (en) * | 2009-12-11 | 2010-05-19 | 北京中机科海科技发展有限公司 | Body automatic build system and method based on text mining |
CN102117281A (en) * | 2009-12-30 | 2011-07-06 | 北京亿维讯科技有限公司 | Method for constructing domain ontology |
CN102360383A (en) * | 2011-10-15 | 2012-02-22 | 西安交通大学 | Method for extracting text-oriented field term and term relationship |
CN103699568A (en) * | 2013-11-16 | 2014-04-02 | 西安交通大学城市学院 | Method for extracting hyponymy relation of field terms from wikipedia |
Non-Patent Citations (1)
Title |
---|
黄毅等: "《一种基于条件随机场的领域术语上下位关系获取方法》", 《中南大学学报(自然科学版)》 * |
Cited By (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108733702A (en) * | 2017-04-20 | 2018-11-02 | 北京京东尚科信息技术有限公司 | User inquires method, apparatus, electronic equipment and the medium of hyponymy extraction |
CN108733702B (en) * | 2017-04-20 | 2020-09-29 | 北京京东尚科信息技术有限公司 | Method, device, electronic equipment and medium for extracting upper and lower relation of user query |
CN107357776A (en) * | 2017-06-16 | 2017-11-17 | 北京奇艺世纪科技有限公司 | A kind of related term method for digging and device |
CN108563617A (en) * | 2018-03-12 | 2018-09-21 | 北京云知声信息技术有限公司 | The method for digging and device of Chinese sentence hybrid template |
CN110674306A (en) * | 2018-06-15 | 2020-01-10 | 株式会社日立制作所 | Knowledge graph construction method and device and electronic equipment |
CN110674306B (en) * | 2018-06-15 | 2023-06-20 | 株式会社日立制作所 | Knowledge graph construction method and device and electronic equipment |
US11556570B2 (en) | 2018-09-20 | 2023-01-17 | International Business Machines Corporation | Extraction of semantic relation |
CN109800308B (en) * | 2019-01-22 | 2022-04-15 | 四川长虹电器股份有限公司 | Short text classification method based on part-of-speech and fuzzy pattern recognition combination |
CN109800308A (en) * | 2019-01-22 | 2019-05-24 | 四川长虹电器股份有限公司 | A kind of short text classification method combined based on part of speech and Fuzzy Pattern Recognition |
CN109933691A (en) * | 2019-02-11 | 2019-06-25 | 北京百度网讯科技有限公司 | Method, apparatus, equipment and storage medium for content retrieval |
CN110209822A (en) * | 2019-06-11 | 2019-09-06 | 中译语通科技股份有限公司 | Sphere of learning data dependence prediction technique based on deep learning, computer |
CN110196982B (en) * | 2019-06-12 | 2022-12-27 | 腾讯科技(深圳)有限公司 | Method and device for extracting upper-lower relation and computer equipment |
CN110196982A (en) * | 2019-06-12 | 2019-09-03 | 腾讯科技(深圳)有限公司 | Hyponymy abstracting method, device and computer equipment |
CN110263342A (en) * | 2019-06-20 | 2019-09-20 | 北京百度网讯科技有限公司 | Method for digging and device, the electronic equipment of the hyponymy of entity |
CN110334186A (en) * | 2019-07-08 | 2019-10-15 | 北京三快在线科技有限公司 | Data query method, apparatus, computer equipment and computer readable storage medium |
CN110334186B (en) * | 2019-07-08 | 2021-09-28 | 北京三快在线科技有限公司 | Data query method and device, computer equipment and computer readable storage medium |
GB2602762A (en) * | 2019-09-18 | 2022-07-13 | Ibm | Hypernym detection using strict partial order networks |
US11068665B2 (en) | 2019-09-18 | 2021-07-20 | International Business Machines Corporation | Hypernym detection using strict partial order networks |
WO2021053511A1 (en) * | 2019-09-18 | 2021-03-25 | International Business Machines Corporation | Hypernym detection using strict partial order networks |
US11694035B2 (en) | 2019-09-18 | 2023-07-04 | International Business Machines Corporation | Hypernym detection using strict partial order networks |
CN114692620A (en) * | 2020-12-28 | 2022-07-01 | 阿里巴巴集团控股有限公司 | Text processing method and device |
CN114020880A (en) * | 2022-01-06 | 2022-02-08 | 杭州费尔斯通科技有限公司 | Method, system, electronic device and storage medium for extracting hypernym |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106569993A (en) | Method and device for mining hypernym-hyponym relation between domain-specific terms | |
CN107436864B (en) | Chinese question-answer semantic similarity calculation method based on Word2Vec | |
Li et al. | Recursive deep models for discourse parsing | |
Ptaszynski et al. | Language combinatorics: A sentence pattern extraction architecture based on combinatorial explosion | |
CN102799577B (en) | A kind of Chinese inter-entity semantic relation extraction method | |
CN105138514B (en) | It is a kind of based on dictionary it is positive gradually plus a word maximum matches Chinese word cutting method | |
CN107133212B (en) | Text implication recognition method based on integrated learning and word and sentence comprehensive information | |
CN106537370A (en) | Method and system for robust tagging of named entities in the presence of source or translation errors | |
CN112948543A (en) | Multi-language multi-document abstract extraction method based on weighted TextRank | |
CN103314369B (en) | Machine translation apparatus and method | |
Sahu et al. | Prashnottar: a Hindi question answering system | |
CN106528524A (en) | Word segmentation method based on MMseg algorithm and pointwise mutual information algorithm | |
CN105808711A (en) | System and method for generating model based on semantic text concept | |
Dien et al. | POS-tagger for English-Vietnamese bilingual corpus | |
Othman et al. | Arabic text processing model: Verbs roots and conjugation automation | |
Sembok et al. | Arabic word stemming algorithms and retrieval effectiveness | |
Sethi et al. | Automated title generation in English language using NLP | |
Kessler et al. | Extraction of terminology in the field of construction | |
CN106126501B (en) | A kind of noun Word sense disambiguation method and device based on interdependent constraint and knowledge | |
CN109002540B (en) | Method for automatically generating Chinese announcement document question answer pairs | |
CN106202033B (en) | A kind of adverbial word Word sense disambiguation method and device based on interdependent constraint and knowledge | |
Ali et al. | A hybrid approach to Urdu verb phrase chunking | |
Outahajala et al. | Using confidence and informativeness criteria to improve POS-tagging in amazigh | |
Al-Arfaj et al. | Arabic NLP tools for ontology construction from Arabic text: An overview | |
Chawla et al. | Pre-trained affective word representations |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170419 |
|
RJ01 | Rejection of invention patent application after publication |