CN103425660B - The acquisition methods and device of a kind of entry - Google Patents
The acquisition methods and device of a kind of entry Download PDFInfo
- Publication number
- CN103425660B CN103425660B CN201210151282.6A CN201210151282A CN103425660B CN 103425660 B CN103425660 B CN 103425660B CN 201210151282 A CN201210151282 A CN 201210151282A CN 103425660 B CN103425660 B CN 103425660B
- Authority
- CN
- China
- Prior art keywords
- anchor text
- extracted
- entry
- existing entry
- distance
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Abstract
The invention provides a kind of acquisition methods of entry and device, wherein, this method includes:Obtain the existing entry set of same classification in entry base;Scanned for using acquired existing entry set, obtain including the Anchor Text of the existing entry, and record the web placement where the Anchor Text of the existing entry;According to the web placement recorded, the Anchor Text that the context distance between the Anchor Text of the existing entry meets preset requirement is extracted in corresponding position.Acquisition methods and device that the present invention is provided, entity entry is excavated using existing dictionary, and user can be instructed to create neologisms, is solved the problem of entity entry includes deficiency in encyclopaedia database, is easy to implement more effective knowledge search.
Description
【Technical field】
The present invention relates to internet information processing technology field, the acquisition methods and device of more particularly to a kind of entry.
【Background technology】
With continuing to develop for communication technology and network, people carry out various knowledge and letter by internet more and more
The search of breath.Encyclopaedia website be an internet it is all with per family can equality the platform browse, create, improving content, for example
Baidupedia, wikipedia, interactive encyclopaedia etc., can allow Internet user to find oneself want complete by encyclopaedia website
Face, accurate, objective definitional information, are available for other users to carry out the inquiry of similar theme and browse, corresponding to provide
Knowledge or reference.
Entry is the based fragmentation unit of content contained by encyclopaedia website, and an entry has one or more single masters
Topic, for illustrating things, a personage or possessing the knowledge content such as combination of particular topic, for example:" the Forbidden City ", " Liu
Moral China ", " 2008 Beijing Olympic Games " etc..Include the entry of a myriad of in encyclopaedia website, these entries have recorded various
Industry, various themes, the content of various kens.For search engine, it can be carried significantly using these encyclopaedia entries
The accuracy and retrieval coverage rate of height retrieval, and be conducive to extracting structural data from webpage, to carry out vertical search,
Obtain more accurate information.
A large amount of propagation and people with information exchange the continuous extension of content, and new term emerges in an endless stream.It is existing new
Entry is closed by manually adding and creating the corresponding knowledge content of new term, and then will be created by way of manual examination and verification
The new term of lattice is added in encyclopaedia website, so that user carries out the search of knowledge and information.New term is not created for one,
Such as new song, film, personage etc., system can't be actively discovered on the internet, cause some new terms timely
Create and update, influence the retrieval rate of search engine, or even can also influence the accuracy and recall rate of retrieval.
【The content of the invention】
In view of this, the invention provides a kind of acquisition methods of entry and device, entity word is excavated using existing dictionary
Bar, can instruct user to create neologisms, solve the problem of entity entry includes deficiency in encyclopaedia database, be easy to implement more effectively
Knowledge search.
Concrete technical scheme is as follows:
A kind of acquisition methods of entry, this method comprises the following steps:
S1, the existing entry set for obtaining same classification in entry base;
S2, scanned for using acquired existing entry set, obtain including the Anchor Text of the existing entry, and remember
Web placement where the Anchor Text of the record existing entry;
S3, according to the web placement recorded, extracted in corresponding position between the Anchor Text of the existing entry
Context distance meets the Anchor Text of preset requirement.
According to one preferred embodiment of the present invention, after the step S3, in addition to:
S4, the power according to the extracted Anchor Text of the context distance calculating between the Anchor Text of the existing entry
Weight, the frequency that the Anchor Text for counting extracted occurs in current class, frequency or weight is met the Anchor Text of preset requirement
It is identified as new term.
According to one preferred embodiment of the present invention, the web placement where the Anchor Text, including:
Position of the web page release and Anchor Text where webpage, Anchor Text in web page release where Anchor Text.
According to one preferred embodiment of the present invention, the context distance, which meets preset requirement, includes:
Web page release where the Anchor Text extracted is identical with the web page release where the Anchor Text of existing entry.
According to one preferred embodiment of the present invention, the context distance, which is met, requires, in addition to:
The spacing distance of the Anchor Text extracted and the Anchor Text of existing entry is less than pre-determined distance threshold value.
According to one preferred embodiment of the present invention, the context distance between the basis and the Anchor Text of the existing entry
The weight of extracted Anchor Text is calculated, is specifically included:
In same web page release, it is determined that the context distance of the Anchor Text extracted and the Anchor Text of existing entry;
Using the context distance of determination, the weight of the Anchor Text extracted in corresponding web page release is calculated;
Under whole current class, the power of the obtained Anchor Text extracted will be calculated in each web page release extracted
Summed again, obtain the weight of extracted Anchor Text.
According to one preferred embodiment of the present invention, it is described that extracted Anchor Text and existing word are determined in same web page release
The context distance of the Anchor Text of bar, is specifically included:
The Anchor Text of the existing entry included in web page release where the extracted Anchor Text of determination;
The distance between Anchor Text of each existing entry of the extracted Anchor Text of calculating and acquisition;
The minimum value of selected distance is used as the context distance with existing entry.
According to one preferred embodiment of the present invention, after the step S3, in addition to:
The Anchor Text extracted is contrasted with the entry base, the Anchor Text do not included;
The step S4 only is performed to the Anchor Text do not included.
According to one preferred embodiment of the present invention, after the step S3, in addition to:
The Anchor Text of part of speech is specified to filter out by not including in the Anchor Text extracted;
The step S4 only is performed to remaining Anchor Text after filtering.
A kind of acquisition device of entry, the device includes:
Existing entry acquisition module, the existing entry set for obtaining same classification in entry base;
Search module, the existing entry set for being obtained using the existing entry acquisition module is scanned for, and is obtained
The Anchor Text of the existing entry is included, and records the web placement where the Anchor Text of the existing entry;
Extraction module, for the web placement recorded according to the search module, extracted in corresponding position with it is described
There is the context distance between the Anchor Text of entry to meet the Anchor Text of preset requirement.
According to one preferred embodiment of the present invention, the device also includes:
New term identification module, for calculating described according to the context distance between the Anchor Text of the existing entry
The weight for the Anchor Text that extraction module is extracted, counts the frequency that extracted Anchor Text occurs in current class, by frequency or
The Anchor Text that weight meets preset requirement is identified as new term.
According to one preferred embodiment of the present invention, the web placement where the Anchor Text, including:
Position of the web page release and Anchor Text where webpage, Anchor Text in web page release where Anchor Text.
According to one preferred embodiment of the present invention, the context distance, which meets preset requirement, includes:
Web page release where the Anchor Text extracted is identical with the web page release where the Anchor Text of existing entry.
According to one preferred embodiment of the present invention, the context distance, which is met, requires, in addition to:
The spacing distance of the Anchor Text extracted and the Anchor Text of existing entry is less than pre-determined distance threshold value.
According to one preferred embodiment of the present invention, the new term identification module, including:
Distance determining unit, in same web page release, it is determined that the anchor text of the Anchor Text extracted and existing entry
This context distance;
Weight calculation unit, for the context distance determined using the distance determining unit, is calculated in corresponding net
The weight of the Anchor Text extracted in page piecemeal;
Weighted units, under whole current class, by being carried that calculating in each web page release extracted is obtained
The weight of the Anchor Text taken is summed, and obtains the weight of extracted Anchor Text.
According to one preferred embodiment of the present invention, the distance determining unit, concrete configuration is:
The Anchor Text of the existing entry included in web page release where the extracted Anchor Text of determination;
The distance between Anchor Text of each existing entry of the extracted Anchor Text of calculating and acquisition;
The minimum value of selected distance is used as the context distance with existing entry.
According to one preferred embodiment of the present invention, the device also includes:
Existing word filtering module, Anchor Text and the entry base for the extraction module to be extracted are contrasted,
The Anchor Text do not included;
And the Anchor Text do not included is supplied to the new word identification module.
According to one preferred embodiment of the present invention, the device also includes:
Part of speech filtering module, the Anchor Text mistake of part of speech is specified for not including in the Anchor Text that extracts the extraction module
Filter;
And remaining Anchor Text after filtering is supplied to the new word identification module.
As can be seen from the above technical solutions, the acquisition methods and device for the entry that the present invention is provided, utilize existing dictionary
Entity entry is excavated there is provided the new term not yet created, user can be instructed to create the corresponding knowledge of new term, encyclopaedia data are solved
The problem of entity entry includes deficiency in storehouse, is conducive to the data information of perfect frame, is easy to implement more effective knowledge and searches
Rope.
【Brief description of the drawings】
Fig. 1 is the acquisition methods flow chart for the entry that the embodiment of the present invention one is provided;
Fig. 2 be webpage and its comprising web page release schematic diagram;
Fig. 3 is some the web page release schematic diagram searched using existing entry " because love ";
Fig. 4 is the acquisition methods flow chart for the entry that the embodiment of the present invention two is provided;
Fig. 5 is the acquisition device schematic diagram for the entry that the embodiment of the present invention three is provided;
Fig. 6 is the acquisition device schematic diagram for the entry that the embodiment of the present invention four is provided.
【Embodiment】
In order that the object, technical solutions and advantages of the present invention are clearer, below in conjunction with the accompanying drawings with specific embodiment pair
The present invention is described in detail.
Embodiment one,
Fig. 1 is the acquisition methods flow chart for the entry that the present embodiment is provided, as shown in figure 1, this method includes:
Step S101, the existing entry set for obtaining same classification in entry base.
The entry base can be the classification entry base such as encyclopaedia entry base, input method entry base, in the present invention with encyclopaedia
Illustrated exemplified by entry base.
The classification can use classification each original classification of entry base, including:Song, film, personage, nature, text
The classification such as change, geography, history, life, society, art, economy, science and technology, physical culture, or, it can utilize existing to existing entry
Classification or clustering method (such as bayes classification method, traditional decision-tree, support vector machines) divide classification.
The existing entry set of same classification in entry base is obtained, the existing entry classified one by one to each in entry base,
Perform step S102 and step S103.
Step S102, scanned for using acquired existing entry set, obtain including the anchor text of the existing entry
This, and record the web placement where the Anchor Text of the existing entry.
In internet web page, scanned for using the existing entry set of acquisition, obtain the anchor text comprising existing entry
This, records the web placement where those Anchor Texts and Anchor Text.
Web placement where Anchor Text can include:The web page release where webpage, Anchor Text where Anchor Text with
And position of the Anchor Text in web page release.Fig. 2 be a webpage and its comprising web page release schematic diagram, as shown in Fig. 2 anchor
Web placement where text 1 is first position in the web page release A of the webpage.
For example, existing categorizing songs set T1, the categorizing songs in encyclopaedia entry are got by step S101
Set T1 includes tens of thousands of existing entries, for example, { because love, like that you arrive bitterly and do not know pain, etc.. }.Found by search
The Anchor Text of existing entry in categorizing songs set T1 is included, for example, scanned for using existing entry " because love ",
http:Anchor Text " because love " is found in //ting.baidu.com webpages, as shown in figure 3, recording the Anchor Text " because love
Web page release and web placement where feelings ".
Or, when scanning for the Anchor Text comprising the existing entry, it can also first obtain each net on internet
All Anchor Texts of page, recycle the existing entry set of each classification to be matched, and find out the Anchor Text that can be matched, and record should
Webpage, web page release and web placement where a little Anchor Texts.
Step S103, according to the web placement recorded, the Anchor Text with the existing entry is extracted in corresponding position
Between context distance meet the Anchor Text of preset requirement.
For the web placement of the Anchor Text of existing entry recorded, extract and met with web placement context distance
It is required that Anchor Text be used as entry.
Wherein, the context distance meets preset requirement and can included:
Web page release where the Anchor Text extracted is identical with the web page release where the Anchor Text of existing entry.As schemed
Anchor Text 1 in 2 is identical with the web page release where Anchor Text 3, but Anchor Text 1 and Anchor Text 5 are then in different webpage point
In block.If Anchor Text 1 is the Anchor Text of existing entry, the Anchor Text that can extract satisfaction requirement is:Anchor Text 2 and anchor
Text 3.
Web page release that specifically, can be according to where page layout label determines Anchor Text, such as page layout label "<
div></div>" and "<table></table>" etc. judged, it is determined whether in identical web page release.Or, also may be used
To determine same web page release according to webpage visual piecemeal etc..
Or, the web page release where the Anchor Text extracted and the web page release phase where the Anchor Text of existing entry
Together, and the spacing distance of the Anchor Text that is extracted and the Anchor Text of existing entry is less than pre-determined distance threshold value.
For example, Fig. 3 is some the web page release schematic diagram searched using existing entry " because love ", in figure 3,
The anchor such as " Wang Fei ", " can not hindering ", " Wang Lin ", " most dazzling national wind ", " phoenix legend ", " new Drunken Concubine ", " keeping of love " text
The Anchor Text " because love " of sheet and existing entry is in same web page release, extracts those Anchor Texts as entry.
In order to further improve precision, the Anchor Text of preset requirement is met in extraction context distance, also to spacing distance
Limit.If the Anchor Text " because love " of the Anchor Text such as " new Drunken Concubine ", " keeping of love " and existing entry in Fig. 3
Between spacing distance when having exceeded pre-determined distance threshold value, then do not extract those Anchor Texts.
The pre-determined distance threshold value is set according to actual needs, such as within 10 characters.
Embodiment two,
Fig. 4 is the acquisition methods flow chart for the entry that the present embodiment is provided, as shown in figure 4, this method includes:
Step S401, the existing entry set for obtaining same classification in entry base.
Step S402, scanned for using acquired existing entry set, obtain including the anchor text of the existing entry
This, and record the web placement where the Anchor Text of the existing entry.
Step S403, according to the web placement recorded, the Anchor Text with the existing entry is extracted in corresponding position
Between context distance meet the Anchor Text of preset requirement.
Above-mentioned steps S401 to S403 is corresponding identical with the step S101 to S103 in embodiment one, is repeated no more in this.
Step S404, the Anchor Text extracted contrasted with the entry base, the Anchor Text do not included.
Because the Anchor Text extracted is possible for existing entry, thus, in order to improve efficiency, to the Anchor Text extracted
Filtered, existing word filtering is fallen, subsequently only to handle the Anchor Text do not included.If " leading in Fig. 3
Hand ", " betrayal love song " are existing entries, then are filtered out.
Because the Anchor Text extracted under some classification may belong to other classification, for example, can be extracted in Fig. 3
The personages such as " Wang Fei ", " Wang Lin ".Thus, the Anchor Text extracted and whole entry base are contrasted, removes and is present in word
Anchor Text in bar storehouse, the Anchor Text do not included.If the Anchor Text do not included belongs to personage or other default related point
Entry under class, is also retained, and further performs step S405 to S406.The default relevant classification refers to that there is association to close
The classification of system, rule of thumb sets, for example, the classification such as categorizing songs and personage, film, amusement has incidence relation.
What deserves to be explained is, when treatment effeciency is less demanding, this step can not also be performed, or, it can also hold
Whether it is not include that row step S406 obtains being identified again after the weight or frequency of Anchor Text, to determine new term.Now,
Following steps S405 to S406 is then that the Anchor Text extracted is performed.
Step S405, the Anchor Text that will not include specified part of speech in the Anchor Text do not included are filtered out.
The Anchor Text obtained for step S404, falls not including and specifies part of speech by participle, part-of-speech tagging technical filter
Anchor Text, the Anchor Text such as filtering out not comprising verb, noun, adjective.
Meanwhile, in order to obtain the entry of specification, be also based on Anchor Text length and comprising punctuation mark carried out
Filter, undesirable Anchor Text is filtered out.
Certainly, this step is also not necessary step.
Step S406, the anchor do not included according to being calculated the context distance between the Anchor Text of the existing entry
The weight of text, the frequency that the statistics Anchor Text do not included occurs in current class, frequency or weight are met and preset
It is required that Anchor Text be identified as new term.
The frequency that remaining Anchor Text occurs in current class after statistic procedure S405 filterings, i.e. occurrence number, and counting
The weight of remaining Anchor Text after step S405 filterings is calculated, specifically, according to upper between the Anchor Text of the existing entry
Hereafter distance calculates the weight of Anchor Text, including:
Step S406_1, in same web page release, it is determined that the Anchor Text of the Anchor Text do not included and existing entry
Context distance.
Specifically, the anchor text of the existing entry included in the web page release where the Anchor Text do not included described in first determining
This.
The distance between the Anchor Text do not included and Anchor Text of each existing entry of acquisition are calculated again.
Wherein, context can be, but not limited to using the word being spaced between the Anchor Text do not included and existing entry apart from d
Accord with string length to calculate, not including symbols such as page layout label, space, carriage returns.
Finally, the minimum value of selected distance is used as the context distance with existing entry.
For example, there are Anchor Text K1, K2, K3 ... the Kn of multiple existing entries in same web page release, and multiple do not receive
Anchor Text L1, L2, L3 of record etc., one by one to the Anchor Text do not included in the web page release, calculate the distance to K1~Kn respectively,
By the context distance for being defined as the Anchor Text and existing entry do not included apart from minimum value drawn.
Step S406_2, the context distance using determination, calculate the anchor do not included described in corresponding web page release
The weight of text.
Using the context distance of the Anchor Text do not included and existing entry, Anchor Text that this does not include is calculated in each net
Weight in page piecemeal.Context distance is nearer, and weight is bigger.
Weight calculation formula can be, but not limited to use:
(formula 1)
As in Fig. 3, in the web page release, calculated using existing entry Anchor Text " because love " and do not include Anchor Text
The weight of " can not hindering ", be specially:
Context apart from d=6, the character string at interval include " 2, Wang Lin ,-, and then obtain weight and be
The like, in each web page release of record, calculate the weight for not including Anchor Text in correspondence piecemeal.
Step S406_3, under whole current class, will in each web page release that extracted calculate obtain described in not
The weight for the Anchor Text included is summed, the weight for the Anchor Text do not included.
Under whole current class, step S406_2 is calculated to the obtained power for not including Anchor Text in each piecemeal
Summation is weighted again, is used as the weight for not including Anchor Text.
For example:Step S406_2 is calculated to the weight summation for obtaining " not hinder " in each web page release to obtain " can not hindering "
Weight be 295.4, judge whether to be more than default weight threshold.
Statistics obtains " can not hindering " and occurred in that in categorizing songs 1442 times, judges whether to be more than default frequency threshold value.
If weight is more than default weight threshold or frequency of occurrence is more than default frequency threshold value, the Anchor Text is recognized
For new term.Can be set according to practical application needs two conditions while when meeting, being just identified as new term.
Step S407, judge whether to have obtained all classification in entry base, if it is, into step S408, output
The recognition result of new term, otherwise, return to step S401 obtain the existing entry set of next classification in entry base, until
Take all classification, output result.
Above is the detailed description carried out to method provided by the present invention, the acquisition of the entry provided below the present invention
Device is described in detail.
Embodiment three
Fig. 5 is the acquisition device schematic diagram for the entry that the present embodiment is provided.As shown in figure 5, the device includes:
Existing entry acquisition module 501, the existing entry set for obtaining same classification in entry base.
The entry base can be the classification entry base such as encyclopaedia entry base, input method entry base, in the present invention with encyclopaedia
Illustrated exemplified by entry base.
The classification can use classification each original classification of entry base, including:Song, film, personage, nature, text
The classification such as change, geography, history, life, society, art, economy, science and technology, physical culture, or, it can utilize existing to existing entry
Classification or clustering method (such as bayes classification method, traditional decision-tree, support vector machines) divide classification.
The existing entry set of same classification in entry base is obtained, one by one the existing entry that each in entry base is classified is carried
Supply search module 502 and extraction module 503 are performed.
Search module 502, the existing entry set for being obtained using existing entry acquisition module 501 is scanned for, and is obtained
To including the Anchor Text of the existing entry, and record the web placement where the Anchor Text of the existing entry.
In internet web page, scanned for using the existing entry set of acquisition, obtain the anchor text comprising existing entry
This, records the web placement where those Anchor Texts and Anchor Text.
Web placement where Anchor Text can include:The web page release where webpage, Anchor Text where Anchor Text with
And position of the Anchor Text in web page release.Fig. 2 be a webpage and its comprising web page release schematic diagram, as shown in Fig. 2 anchor
Web placement where text 1 is first position in the web page release A of the webpage.
For example, existing categorizing songs set T1 in encyclopaedia entry is got by existing entry acquisition module 501,
Categorizing songs set T1 includes tens of thousands of existing entries, for example, { because love, like that you arrive bitterly and do not know pain, etc.. }.It is logical
Cross search and find the Anchor Text for including existing entry in categorizing songs set T1, for example, being entered using existing entry " because love "
Row search, in http:Anchor Text " because love " is found in //ting.baidu.com webpages, as shown in figure 3, recording anchor text
Web page release and web placement where this " because love ".
Or, when scanning for the Anchor Text comprising the existing entry, it can also first obtain each net on internet
All Anchor Texts of page, recycle the existing entry set of each classification to be matched, and find out the Anchor Text that can be matched, and record should
Webpage, web page release and web placement where a little Anchor Texts.
Extraction module 503, for the web placement recorded according to search module 502, extracted in corresponding position with it is described
Context distance between the Anchor Text of existing entry meets the Anchor Text of preset requirement.
For the web placement of the Anchor Text of existing entry recorded, extract and met with web placement context distance
It is required that Anchor Text be used as entry.
Wherein, the context distance meets preset requirement and can included:
Web page release where the Anchor Text extracted is identical with the web page release where the Anchor Text of existing entry.As schemed
Anchor Text 1 in 2 is identical with the web page release where Anchor Text 3, but Anchor Text 1 and Anchor Text 5 are then in different webpage point
In block.If Anchor Text 1 is the Anchor Text of existing entry, the Anchor Text that can extract satisfaction requirement is:Anchor Text 2 and anchor
Text 3.
Web page release that specifically, can be according to where page layout label determines Anchor Text, such as page layout label "<
div></div>" and "<table></table>" etc. judged, it is determined whether in identical web page release.Or, also may be used
To determine same web page release according to webpage visual piecemeal etc..
Or, the web page release where the Anchor Text extracted and the web page release phase where the Anchor Text of existing entry
Together, and the spacing distance of the Anchor Text that is extracted and the Anchor Text of existing entry is less than pre-determined distance threshold value.
For example, Fig. 3 is some the web page release schematic diagram searched using existing entry " because love ", in figure 3,
The anchor such as " Wang Fei ", " can not hindering ", " Wang Lin ", " most dazzling national wind ", " phoenix legend ", " new Drunken Concubine ", " keeping of love " text
The Anchor Text " because love " of sheet and existing entry is in same web page release, extracts those Anchor Texts as entry.
In order to further improve precision, the Anchor Text of preset requirement is met in extraction context distance, also to spacing distance
Limit.If the Anchor Text " because love " of the Anchor Text such as " new Drunken Concubine ", " keeping of love " and existing entry in Fig. 3
Between spacing distance when having exceeded pre-determined distance threshold value, then do not extract those Anchor Texts.
The pre-determined distance threshold value is set according to actual needs, such as within 10 characters.
Example IV,
Fig. 6 is the acquisition device schematic diagram for the entry that the present embodiment is provided, as shown in fig. 6, the device includes:
Existing entry acquisition module 601, the existing entry set for obtaining same classification in entry base.
Search module 602, the existing entry set for being obtained using existing entry acquisition module 601 is scanned for, and is obtained
To including the Anchor Text of the existing entry, and record the web placement where the Anchor Text of the existing entry.
Extraction module 603, for the web placement recorded according to search module 602, extracted in corresponding position with it is described
Context distance between the Anchor Text of existing entry meets the Anchor Text of preset requirement.
Above-mentioned module 601 to 603 is corresponding identical with 501 to 503 configuration in embodiment three, is repeated no more in this.
Existing word filtering module 604, for the Anchor Text extracted to be contrasted with the entry base, is not received
The Anchor Text of record.
Because the Anchor Text extracted is possible for existing entry, thus, in order to improve efficiency, to the Anchor Text extracted
Filtered, existing word filtering is fallen, subsequently only to handle the Anchor Text do not included.If " leading in Fig. 3
Hand ", " betrayal love song " are existing entries, then are filtered out.
Because the Anchor Text extracted under some classification may belong to other classification, for example, can be extracted in Fig. 3
The personages such as " Wang Fei ", " Wang Lin ".Thus, the Anchor Text extracted and whole entry base are contrasted, removes and is present in word
Anchor Text in bar storehouse, the Anchor Text do not included.If the Anchor Text do not included belongs to personage or other default related point
Entry under class, is also retained, and supplies follow-up part of speech filtering module 605 and new term identification module 606 is further is located
Reason.The default relevant classification refers to the classification with incidence relation, rule of thumb sets, for example, categorizing songs and personage, electricity
The classification such as shadow, amusement has incidence relation.
What deserves to be explained is, when treatment effeciency is less demanding, this module can also be not provided with, or, can also be new
Whether recycle this module to be identified after the weight or frequency that obtain Anchor Text in entry identification module 606 is not include,
To determine new term.Now, part of speech filtering module 605 and new term identification module 606 are then that the Anchor Text extracted is performed.
Part of speech filtering module 605, specifies the Anchor Text of part of speech to filter out for will not include in the Anchor Text do not included.
The Anchor Text obtained for existing word filtering module 604, falls not including by participle, part-of-speech tagging technical filter
Specify the Anchor Text of part of speech, the Anchor Text such as filtering out not comprising verb, noun, adjective.
Meanwhile, in order to obtain the entry of specification, be also based on Anchor Text length and comprising punctuation mark carried out
Filter, undesirable Anchor Text is filtered out.
Certainly, this module is also not necessary module.
New term identification module 606, for being calculated according to the context distance between the Anchor Text of the existing entry
The weight of the Anchor Text do not included, the frequency that the Anchor Text do not included described in statistics occurs in current class, by frequency
Or weight meets the Anchor Text of preset requirement and is identified as new term.
The frequency that remaining Anchor Text occurs in current class after statistics part of speech filtering module 605 is filtered, that is, go out occurrence
Number, and calculate the weight of remaining Anchor Text after part of speech filtering module 605 is filtered, specifically, according to the existing entry
Context distance between Anchor Text calculates the weight of Anchor Text, including:
Distance determining unit, in same web page release, it is determined that the Anchor Text do not included and existing entry
The context distance of Anchor Text.
Specifically, distance determining unit first determines that what is included in the web page release where the Anchor Text do not included has
The Anchor Text of entry.The distance between the Anchor Text do not included and Anchor Text of each existing entry of acquisition are calculated again.
Wherein, context can be, but not limited to using the word being spaced between the Anchor Text do not included and existing entry apart from d
Accord with string length to calculate, not including symbols such as page layout label, space, carriage returns.
Finally, the minimum value of distance determining unit selected distance is used as the context distance with existing entry.
For example, there are Anchor Text K1, K2, K3 ... the Kn of multiple existing entries in same web page release, and multiple do not receive
Anchor Text L1, L2, L3 of record etc., one by one to the Anchor Text do not included in the web page release, calculate the distance to K1~Kn respectively,
By the context distance for being defined as the Anchor Text and existing entry do not included apart from minimum value drawn.
Weight calculation unit, for the context distance determined using distance determining unit, is calculated in corresponding webpage point
The weight for the Anchor Text do not included described in block.
Weight calculation unit calculates the anchor do not included using the context distance of the Anchor Text do not included and existing entry
Weight of the text in each web page release, context distance is nearer, and weight is bigger.
Weight calculation formula can be, but not limited to be calculated using formula 1.
As in Fig. 3, in the web page release, calculated using existing entry Anchor Text " because love " and do not include Anchor Text
The weight of " can not hindering ", be specially:
Context apart from d=6, the character string at interval include " 2, Wang Lin ,-, and then obtain weight and be
The like, in each web page release of record, calculate the weight for not including Anchor Text in correspondence piecemeal.
Weighted units, under whole current class, will be calculated in each web page release extracted described in obtaining
The weight for the Anchor Text do not included is summed, the weight for the Anchor Text do not included.
Under whole current class, Anchor Text is not included in each piecemeal by what weight calculation unit calculating was obtained
Weight is weighted summation, is used as the weight for not including Anchor Text.
For example:Weight calculation unit is calculated to the weight summation for obtaining " not hinder " in each web page release to obtain " hindering not
Rise " weight be 295.4, judge whether to be more than default weight threshold.
The statistics of new term identification module 606 obtains " can not hindering " and occurred in that in categorizing songs 1442 times, judges whether big
In default frequency threshold value.
If weight is more than default weight threshold or frequency of occurrence is more than default frequency threshold value, the Anchor Text is recognized
For new term.Can be set according to practical application needs two conditions while when meeting, being just identified as new term.
Judge module 607, for judging whether to have obtained all classification in entry base, if it is, defeated into result
Go out module 608, export the recognition result of new term, otherwise, existing entry acquisition module 601 is back to, under obtaining in entry base
The existing entry set of one classification, until taking all classification, output result.
The acquisition methods and device for the entry that the present invention is provided, excavating entity entry with existing dictionary, there is provided not yet create
New term, user can be instructed to create the corresponding knowledge of new term, entity entry in encyclopaedia database is solved and include asking for deficiency
Topic, is conducive to the data information (entity entry-attribute-name-property value) of perfect frame, is easy to implement more effective knowledge and searches
Rope.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention
God is with principle, and any modification, equivalent substitution and improvements done etc. should be included within the scope of protection of the invention.
Claims (18)
1. a kind of acquisition methods of entry, it is characterised in that including:
S1, the existing entry set for obtaining same classification in entry base;
S2, scanned for using acquired existing entry set, obtain including the Anchor Text of the existing entry, and record institute
State the web placement where the Anchor Text of existing entry;
S3, according to the web placement recorded, extracted in corresponding position above and below between the Anchor Text of the existing entry
Literary distance meets the Anchor Text of preset requirement.
2. according to the method described in claim 1, it is characterised in that after the step S3, in addition to:
S4, the weight according to the extracted Anchor Text of the context distance calculating between the Anchor Text of the existing entry, system
The frequency that the extracted Anchor Text of meter occurs in current class, the Anchor Text that frequency or weight are met into preset requirement is identified as
New term.
3. method according to claim 1 or 2, it is characterised in that the web placement where the Anchor Text, including:
Position of the web page release and Anchor Text where webpage, Anchor Text in web page release where Anchor Text.
4. method according to claim 3, it is characterised in that the context distance, which meets preset requirement, to be included:
Web page release where the Anchor Text extracted is identical with the web page release where the Anchor Text of existing entry.
5. method according to claim 4, it is characterised in that the context distance, which is met, to be required, in addition to:
The spacing distance of the Anchor Text extracted and the Anchor Text of existing entry is less than pre-determined distance threshold value.
6. method according to claim 2, it is characterised in that between the basis and the Anchor Text of the existing entry
Context distance calculates the weight of extracted Anchor Text, specifically includes:
In same web page release, it is determined that the context distance of the Anchor Text extracted and the Anchor Text of existing entry;
Using the context distance of determination, the weight of the Anchor Text extracted in corresponding web page release is calculated;
Under whole current class, the weight that the obtained Anchor Text extracted is calculated in each web page release extracted is entered
Row summation, obtains the weight of extracted Anchor Text.
7. method according to claim 6, it is characterised in that described that extracted anchor text is determined in same web page release
Originally with the context distance of the Anchor Text of existing entry, specifically include:
The Anchor Text of the existing entry included in web page release where the extracted Anchor Text of determination;
The distance between Anchor Text of each existing entry of the extracted Anchor Text of calculating and acquisition;
The minimum value of selected distance is used as the context distance with existing entry.
8. method according to claim 6, it is characterised in that after the step S3, in addition to:
The Anchor Text extracted is contrasted with the entry base, the Anchor Text do not included;
The step S4 only is performed to the Anchor Text do not included.
9. method according to claim 2, it is characterised in that after the step S3, in addition to:
The Anchor Text of part of speech is specified to filter out by not including in the Anchor Text extracted;
The step S4 only is performed to remaining Anchor Text after filtering.
10. a kind of acquisition device of entry, it is characterised in that including:
Existing entry acquisition module, the existing entry set for obtaining same classification in entry base;
Search module, the existing entry set for being obtained using the existing entry acquisition module is scanned for, comprising
The Anchor Text of the existing entry, and record the web placement where the Anchor Text of the existing entry;
Extraction module, for the web placement recorded according to the search module, is extracted and the existing word in corresponding position
Context distance between the Anchor Text of bar meets the Anchor Text of preset requirement.
11. device according to claim 10, it is characterised in that the device also includes:
New term identification module, for calculating described extract according to the context distance between the Anchor Text of the existing entry
The weight for the Anchor Text that module is extracted, the frequency that the Anchor Text for counting extracted occurs in current class, by frequency or weight
The Anchor Text for meeting preset requirement is identified as new term.
12. the device according to claim 10 or 11, it is characterised in that the web placement where the Anchor Text, including:
Position of the web page release and Anchor Text where webpage, Anchor Text in web page release where Anchor Text.
13. device according to claim 12, it is characterised in that the context distance, which meets preset requirement, to be included:
Web page release where the Anchor Text extracted is identical with the web page release where the Anchor Text of existing entry.
14. device according to claim 13, it is characterised in that the context distance, which is met, to be required, in addition to:
The spacing distance of the Anchor Text extracted and the Anchor Text of existing entry is less than pre-determined distance threshold value.
15. device according to claim 11, it is characterised in that the new term identification module, including:
Distance determining unit, in same web page release, it is determined that the Anchor Text extracted and the Anchor Text of existing entry
Context distance;
Weight calculation unit, for the context distance determined using the distance determining unit, is calculated in corresponding webpage point
The weight of the Anchor Text extracted in block;
Weighted units, under whole current class, will calculate being extracted of obtaining in each web page release extracted
The weight of Anchor Text is summed, and obtains the weight of extracted Anchor Text.
16. device according to claim 15, it is characterised in that the distance determining unit, concrete configuration is:
The Anchor Text of the existing entry included in web page release where the extracted Anchor Text of determination;
The distance between Anchor Text of each existing entry of the extracted Anchor Text of calculating and acquisition;
The minimum value of selected distance is used as the context distance with existing entry.
17. device according to claim 15, it is characterised in that the device also includes:
Existing word filtering module, Anchor Text and the entry base for the extraction module to be extracted are contrasted, obtained
The Anchor Text do not included;
And the Anchor Text do not included is supplied to the new term identification module.
18. device according to claim 11, it is characterised in that the device also includes:
Part of speech filtering module, the Anchor Text filtering of part of speech is specified for not including in the Anchor Text that extracts the extraction module
Fall;
And remaining Anchor Text after filtering is supplied to the new term identification module.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210151282.6A CN103425660B (en) | 2012-05-15 | 2012-05-15 | The acquisition methods and device of a kind of entry |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210151282.6A CN103425660B (en) | 2012-05-15 | 2012-05-15 | The acquisition methods and device of a kind of entry |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103425660A CN103425660A (en) | 2013-12-04 |
CN103425660B true CN103425660B (en) | 2017-10-17 |
Family
ID=49650418
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201210151282.6A Active CN103425660B (en) | 2012-05-15 | 2012-05-15 | The acquisition methods and device of a kind of entry |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103425660B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104978354B (en) * | 2014-04-10 | 2020-11-06 | 中电长城网际系统应用有限公司 | Text classification method and device |
CN104102738B (en) * | 2014-07-28 | 2018-04-27 | 百度在线网络技术(北京)有限公司 | A kind of method and device for expanding entity storehouse |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7257530B2 (en) * | 2002-02-27 | 2007-08-14 | Hongfeng Yin | Method and system of knowledge based search engine using text mining |
US7657507B2 (en) * | 2007-03-02 | 2010-02-02 | Microsoft Corporation | Pseudo-anchor text extraction for vertical search |
CN101251854A (en) * | 2008-03-19 | 2008-08-27 | 深圳先进技术研究院 | Method for creating index lexical item as well as data retrieval method and system |
CN102043808B (en) * | 2009-10-14 | 2014-06-18 | 腾讯科技(深圳)有限公司 | Method and equipment for extracting bilingual terms using webpage structure |
-
2012
- 2012-05-15 CN CN201210151282.6A patent/CN103425660B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN103425660A (en) | 2013-12-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104408093B (en) | A kind of media event key element abstracting method and device | |
CN103336766B (en) | Short text garbage identification and modeling method and device | |
CN105069102B (en) | Information push method and apparatus | |
CN104615593B (en) | Hot microblog topic automatic testing method and device | |
CN105260359B (en) | Semantic key words extracting method and device | |
CN109582704B (en) | Recruitment information and the matched method of job seeker resume | |
CN109145216A (en) | Network public-opinion monitoring method, device and storage medium | |
CN110297988A (en) | Hot topic detection method based on weighting LDA and improvement Single-Pass clustering algorithm | |
CN109325165A (en) | Internet public opinion analysis method, apparatus and storage medium | |
CN107784092A (en) | A kind of method, server and computer-readable medium for recommending hot word | |
CN110020189A (en) | A kind of article recommended method based on Chinese Similarity measures | |
JP2005085285A5 (en) | ||
EP2557511B1 (en) | Information processing device, information processing method, information processing programme, and recording medium | |
CN103902619B (en) | A kind of network public-opinion monitoring method and system | |
CN103077190A (en) | Hot event ranking method based on order learning technology | |
CN104951469B (en) | Optimize the method and apparatus of corpus | |
CN102929861A (en) | Method and system for calculating text emotion index | |
CN107945033A (en) | A kind of analysis method of network public-opinion, system and relevant apparatus | |
CN103646074B (en) | It is a kind of to determine the method and device that picture cluster describes text core word | |
CN103186556A (en) | Method for obtaining and searching structural semantic knowledge and corresponding device | |
CN107832467A (en) | A kind of microblog topic detecting method based on improved Single pass clustering algorithms | |
CN109299469A (en) | A method of identifying complicated address in long text | |
CN109033166A (en) | A kind of character attribute extraction training dataset construction method | |
CN107526792A (en) | A kind of Chinese question sentence keyword rapid extracting method | |
CN113076735A (en) | Target information acquisition method and device and server |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |