CN103425660B - The acquisition methods and device of a kind of entry - Google Patents

The acquisition methods and device of a kind of entry Download PDF

Info

Publication number
CN103425660B
CN103425660B CN201210151282.6A CN201210151282A CN103425660B CN 103425660 B CN103425660 B CN 103425660B CN 201210151282 A CN201210151282 A CN 201210151282A CN 103425660 B CN103425660 B CN 103425660B
Authority
CN
China
Prior art keywords
anchor text
extracted
entry
existing entry
distance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201210151282.6A
Other languages
Chinese (zh)
Other versions
CN103425660A (en
Inventor
李永强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201210151282.6A priority Critical patent/CN103425660B/en
Publication of CN103425660A publication Critical patent/CN103425660A/en
Application granted granted Critical
Publication of CN103425660B publication Critical patent/CN103425660B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention provides a kind of acquisition methods of entry and device, wherein, this method includes:Obtain the existing entry set of same classification in entry base;Scanned for using acquired existing entry set, obtain including the Anchor Text of the existing entry, and record the web placement where the Anchor Text of the existing entry;According to the web placement recorded, the Anchor Text that the context distance between the Anchor Text of the existing entry meets preset requirement is extracted in corresponding position.Acquisition methods and device that the present invention is provided, entity entry is excavated using existing dictionary, and user can be instructed to create neologisms, is solved the problem of entity entry includes deficiency in encyclopaedia database, is easy to implement more effective knowledge search.

Description

The acquisition methods and device of a kind of entry
【Technical field】
The present invention relates to internet information processing technology field, the acquisition methods and device of more particularly to a kind of entry.
【Background technology】
With continuing to develop for communication technology and network, people carry out various knowledge and letter by internet more and more The search of breath.Encyclopaedia website be an internet it is all with per family can equality the platform browse, create, improving content, for example Baidupedia, wikipedia, interactive encyclopaedia etc., can allow Internet user to find oneself want complete by encyclopaedia website Face, accurate, objective definitional information, are available for other users to carry out the inquiry of similar theme and browse, corresponding to provide Knowledge or reference.
Entry is the based fragmentation unit of content contained by encyclopaedia website, and an entry has one or more single masters Topic, for illustrating things, a personage or possessing the knowledge content such as combination of particular topic, for example:" the Forbidden City ", " Liu Moral China ", " 2008 Beijing Olympic Games " etc..Include the entry of a myriad of in encyclopaedia website, these entries have recorded various Industry, various themes, the content of various kens.For search engine, it can be carried significantly using these encyclopaedia entries The accuracy and retrieval coverage rate of height retrieval, and be conducive to extracting structural data from webpage, to carry out vertical search, Obtain more accurate information.
A large amount of propagation and people with information exchange the continuous extension of content, and new term emerges in an endless stream.It is existing new Entry is closed by manually adding and creating the corresponding knowledge content of new term, and then will be created by way of manual examination and verification The new term of lattice is added in encyclopaedia website, so that user carries out the search of knowledge and information.New term is not created for one, Such as new song, film, personage etc., system can't be actively discovered on the internet, cause some new terms timely Create and update, influence the retrieval rate of search engine, or even can also influence the accuracy and recall rate of retrieval.
【The content of the invention】
In view of this, the invention provides a kind of acquisition methods of entry and device, entity word is excavated using existing dictionary Bar, can instruct user to create neologisms, solve the problem of entity entry includes deficiency in encyclopaedia database, be easy to implement more effectively Knowledge search.
Concrete technical scheme is as follows:
A kind of acquisition methods of entry, this method comprises the following steps:
S1, the existing entry set for obtaining same classification in entry base;
S2, scanned for using acquired existing entry set, obtain including the Anchor Text of the existing entry, and remember Web placement where the Anchor Text of the record existing entry;
S3, according to the web placement recorded, extracted in corresponding position between the Anchor Text of the existing entry Context distance meets the Anchor Text of preset requirement.
According to one preferred embodiment of the present invention, after the step S3, in addition to:
S4, the power according to the extracted Anchor Text of the context distance calculating between the Anchor Text of the existing entry Weight, the frequency that the Anchor Text for counting extracted occurs in current class, frequency or weight is met the Anchor Text of preset requirement It is identified as new term.
According to one preferred embodiment of the present invention, the web placement where the Anchor Text, including:
Position of the web page release and Anchor Text where webpage, Anchor Text in web page release where Anchor Text.
According to one preferred embodiment of the present invention, the context distance, which meets preset requirement, includes:
Web page release where the Anchor Text extracted is identical with the web page release where the Anchor Text of existing entry.
According to one preferred embodiment of the present invention, the context distance, which is met, requires, in addition to:
The spacing distance of the Anchor Text extracted and the Anchor Text of existing entry is less than pre-determined distance threshold value.
According to one preferred embodiment of the present invention, the context distance between the basis and the Anchor Text of the existing entry The weight of extracted Anchor Text is calculated, is specifically included:
In same web page release, it is determined that the context distance of the Anchor Text extracted and the Anchor Text of existing entry;
Using the context distance of determination, the weight of the Anchor Text extracted in corresponding web page release is calculated;
Under whole current class, the power of the obtained Anchor Text extracted will be calculated in each web page release extracted Summed again, obtain the weight of extracted Anchor Text.
According to one preferred embodiment of the present invention, it is described that extracted Anchor Text and existing word are determined in same web page release The context distance of the Anchor Text of bar, is specifically included:
The Anchor Text of the existing entry included in web page release where the extracted Anchor Text of determination;
The distance between Anchor Text of each existing entry of the extracted Anchor Text of calculating and acquisition;
The minimum value of selected distance is used as the context distance with existing entry.
According to one preferred embodiment of the present invention, after the step S3, in addition to:
The Anchor Text extracted is contrasted with the entry base, the Anchor Text do not included;
The step S4 only is performed to the Anchor Text do not included.
According to one preferred embodiment of the present invention, after the step S3, in addition to:
The Anchor Text of part of speech is specified to filter out by not including in the Anchor Text extracted;
The step S4 only is performed to remaining Anchor Text after filtering.
A kind of acquisition device of entry, the device includes:
Existing entry acquisition module, the existing entry set for obtaining same classification in entry base;
Search module, the existing entry set for being obtained using the existing entry acquisition module is scanned for, and is obtained The Anchor Text of the existing entry is included, and records the web placement where the Anchor Text of the existing entry;
Extraction module, for the web placement recorded according to the search module, extracted in corresponding position with it is described There is the context distance between the Anchor Text of entry to meet the Anchor Text of preset requirement.
According to one preferred embodiment of the present invention, the device also includes:
New term identification module, for calculating described according to the context distance between the Anchor Text of the existing entry The weight for the Anchor Text that extraction module is extracted, counts the frequency that extracted Anchor Text occurs in current class, by frequency or The Anchor Text that weight meets preset requirement is identified as new term.
According to one preferred embodiment of the present invention, the web placement where the Anchor Text, including:
Position of the web page release and Anchor Text where webpage, Anchor Text in web page release where Anchor Text.
According to one preferred embodiment of the present invention, the context distance, which meets preset requirement, includes:
Web page release where the Anchor Text extracted is identical with the web page release where the Anchor Text of existing entry.
According to one preferred embodiment of the present invention, the context distance, which is met, requires, in addition to:
The spacing distance of the Anchor Text extracted and the Anchor Text of existing entry is less than pre-determined distance threshold value.
According to one preferred embodiment of the present invention, the new term identification module, including:
Distance determining unit, in same web page release, it is determined that the anchor text of the Anchor Text extracted and existing entry This context distance;
Weight calculation unit, for the context distance determined using the distance determining unit, is calculated in corresponding net The weight of the Anchor Text extracted in page piecemeal;
Weighted units, under whole current class, by being carried that calculating in each web page release extracted is obtained The weight of the Anchor Text taken is summed, and obtains the weight of extracted Anchor Text.
According to one preferred embodiment of the present invention, the distance determining unit, concrete configuration is:
The Anchor Text of the existing entry included in web page release where the extracted Anchor Text of determination;
The distance between Anchor Text of each existing entry of the extracted Anchor Text of calculating and acquisition;
The minimum value of selected distance is used as the context distance with existing entry.
According to one preferred embodiment of the present invention, the device also includes:
Existing word filtering module, Anchor Text and the entry base for the extraction module to be extracted are contrasted, The Anchor Text do not included;
And the Anchor Text do not included is supplied to the new word identification module.
According to one preferred embodiment of the present invention, the device also includes:
Part of speech filtering module, the Anchor Text mistake of part of speech is specified for not including in the Anchor Text that extracts the extraction module Filter;
And remaining Anchor Text after filtering is supplied to the new word identification module.
As can be seen from the above technical solutions, the acquisition methods and device for the entry that the present invention is provided, utilize existing dictionary Entity entry is excavated there is provided the new term not yet created, user can be instructed to create the corresponding knowledge of new term, encyclopaedia data are solved The problem of entity entry includes deficiency in storehouse, is conducive to the data information of perfect frame, is easy to implement more effective knowledge and searches Rope.
【Brief description of the drawings】
Fig. 1 is the acquisition methods flow chart for the entry that the embodiment of the present invention one is provided;
Fig. 2 be webpage and its comprising web page release schematic diagram;
Fig. 3 is some the web page release schematic diagram searched using existing entry " because love ";
Fig. 4 is the acquisition methods flow chart for the entry that the embodiment of the present invention two is provided;
Fig. 5 is the acquisition device schematic diagram for the entry that the embodiment of the present invention three is provided;
Fig. 6 is the acquisition device schematic diagram for the entry that the embodiment of the present invention four is provided.
【Embodiment】
In order that the object, technical solutions and advantages of the present invention are clearer, below in conjunction with the accompanying drawings with specific embodiment pair The present invention is described in detail.
Embodiment one,
Fig. 1 is the acquisition methods flow chart for the entry that the present embodiment is provided, as shown in figure 1, this method includes:
Step S101, the existing entry set for obtaining same classification in entry base.
The entry base can be the classification entry base such as encyclopaedia entry base, input method entry base, in the present invention with encyclopaedia Illustrated exemplified by entry base.
The classification can use classification each original classification of entry base, including:Song, film, personage, nature, text The classification such as change, geography, history, life, society, art, economy, science and technology, physical culture, or, it can utilize existing to existing entry Classification or clustering method (such as bayes classification method, traditional decision-tree, support vector machines) divide classification.
The existing entry set of same classification in entry base is obtained, the existing entry classified one by one to each in entry base, Perform step S102 and step S103.
Step S102, scanned for using acquired existing entry set, obtain including the anchor text of the existing entry This, and record the web placement where the Anchor Text of the existing entry.
In internet web page, scanned for using the existing entry set of acquisition, obtain the anchor text comprising existing entry This, records the web placement where those Anchor Texts and Anchor Text.
Web placement where Anchor Text can include:The web page release where webpage, Anchor Text where Anchor Text with And position of the Anchor Text in web page release.Fig. 2 be a webpage and its comprising web page release schematic diagram, as shown in Fig. 2 anchor Web placement where text 1 is first position in the web page release A of the webpage.
For example, existing categorizing songs set T1, the categorizing songs in encyclopaedia entry are got by step S101 Set T1 includes tens of thousands of existing entries, for example, { because love, like that you arrive bitterly and do not know pain, etc.. }.Found by search The Anchor Text of existing entry in categorizing songs set T1 is included, for example, scanned for using existing entry " because love ", http:Anchor Text " because love " is found in //ting.baidu.com webpages, as shown in figure 3, recording the Anchor Text " because love Web page release and web placement where feelings ".
Or, when scanning for the Anchor Text comprising the existing entry, it can also first obtain each net on internet All Anchor Texts of page, recycle the existing entry set of each classification to be matched, and find out the Anchor Text that can be matched, and record should Webpage, web page release and web placement where a little Anchor Texts.
Step S103, according to the web placement recorded, the Anchor Text with the existing entry is extracted in corresponding position Between context distance meet the Anchor Text of preset requirement.
For the web placement of the Anchor Text of existing entry recorded, extract and met with web placement context distance It is required that Anchor Text be used as entry.
Wherein, the context distance meets preset requirement and can included:
Web page release where the Anchor Text extracted is identical with the web page release where the Anchor Text of existing entry.As schemed Anchor Text 1 in 2 is identical with the web page release where Anchor Text 3, but Anchor Text 1 and Anchor Text 5 are then in different webpage point In block.If Anchor Text 1 is the Anchor Text of existing entry, the Anchor Text that can extract satisfaction requirement is:Anchor Text 2 and anchor Text 3.
Web page release that specifically, can be according to where page layout label determines Anchor Text, such as page layout label "< div></div>" and "<table></table>" etc. judged, it is determined whether in identical web page release.Or, also may be used To determine same web page release according to webpage visual piecemeal etc..
Or, the web page release where the Anchor Text extracted and the web page release phase where the Anchor Text of existing entry Together, and the spacing distance of the Anchor Text that is extracted and the Anchor Text of existing entry is less than pre-determined distance threshold value.
For example, Fig. 3 is some the web page release schematic diagram searched using existing entry " because love ", in figure 3, The anchor such as " Wang Fei ", " can not hindering ", " Wang Lin ", " most dazzling national wind ", " phoenix legend ", " new Drunken Concubine ", " keeping of love " text The Anchor Text " because love " of sheet and existing entry is in same web page release, extracts those Anchor Texts as entry.
In order to further improve precision, the Anchor Text of preset requirement is met in extraction context distance, also to spacing distance Limit.If the Anchor Text " because love " of the Anchor Text such as " new Drunken Concubine ", " keeping of love " and existing entry in Fig. 3 Between spacing distance when having exceeded pre-determined distance threshold value, then do not extract those Anchor Texts.
The pre-determined distance threshold value is set according to actual needs, such as within 10 characters.
Embodiment two,
Fig. 4 is the acquisition methods flow chart for the entry that the present embodiment is provided, as shown in figure 4, this method includes:
Step S401, the existing entry set for obtaining same classification in entry base.
Step S402, scanned for using acquired existing entry set, obtain including the anchor text of the existing entry This, and record the web placement where the Anchor Text of the existing entry.
Step S403, according to the web placement recorded, the Anchor Text with the existing entry is extracted in corresponding position Between context distance meet the Anchor Text of preset requirement.
Above-mentioned steps S401 to S403 is corresponding identical with the step S101 to S103 in embodiment one, is repeated no more in this.
Step S404, the Anchor Text extracted contrasted with the entry base, the Anchor Text do not included.
Because the Anchor Text extracted is possible for existing entry, thus, in order to improve efficiency, to the Anchor Text extracted Filtered, existing word filtering is fallen, subsequently only to handle the Anchor Text do not included.If " leading in Fig. 3 Hand ", " betrayal love song " are existing entries, then are filtered out.
Because the Anchor Text extracted under some classification may belong to other classification, for example, can be extracted in Fig. 3 The personages such as " Wang Fei ", " Wang Lin ".Thus, the Anchor Text extracted and whole entry base are contrasted, removes and is present in word Anchor Text in bar storehouse, the Anchor Text do not included.If the Anchor Text do not included belongs to personage or other default related point Entry under class, is also retained, and further performs step S405 to S406.The default relevant classification refers to that there is association to close The classification of system, rule of thumb sets, for example, the classification such as categorizing songs and personage, film, amusement has incidence relation.
What deserves to be explained is, when treatment effeciency is less demanding, this step can not also be performed, or, it can also hold Whether it is not include that row step S406 obtains being identified again after the weight or frequency of Anchor Text, to determine new term.Now, Following steps S405 to S406 is then that the Anchor Text extracted is performed.
Step S405, the Anchor Text that will not include specified part of speech in the Anchor Text do not included are filtered out.
The Anchor Text obtained for step S404, falls not including and specifies part of speech by participle, part-of-speech tagging technical filter Anchor Text, the Anchor Text such as filtering out not comprising verb, noun, adjective.
Meanwhile, in order to obtain the entry of specification, be also based on Anchor Text length and comprising punctuation mark carried out Filter, undesirable Anchor Text is filtered out.
Certainly, this step is also not necessary step.
Step S406, the anchor do not included according to being calculated the context distance between the Anchor Text of the existing entry The weight of text, the frequency that the statistics Anchor Text do not included occurs in current class, frequency or weight are met and preset It is required that Anchor Text be identified as new term.
The frequency that remaining Anchor Text occurs in current class after statistic procedure S405 filterings, i.e. occurrence number, and counting The weight of remaining Anchor Text after step S405 filterings is calculated, specifically, according to upper between the Anchor Text of the existing entry Hereafter distance calculates the weight of Anchor Text, including:
Step S406_1, in same web page release, it is determined that the Anchor Text of the Anchor Text do not included and existing entry Context distance.
Specifically, the anchor text of the existing entry included in the web page release where the Anchor Text do not included described in first determining This.
The distance between the Anchor Text do not included and Anchor Text of each existing entry of acquisition are calculated again.
Wherein, context can be, but not limited to using the word being spaced between the Anchor Text do not included and existing entry apart from d Accord with string length to calculate, not including symbols such as page layout label, space, carriage returns.
Finally, the minimum value of selected distance is used as the context distance with existing entry.
For example, there are Anchor Text K1, K2, K3 ... the Kn of multiple existing entries in same web page release, and multiple do not receive Anchor Text L1, L2, L3 of record etc., one by one to the Anchor Text do not included in the web page release, calculate the distance to K1~Kn respectively, By the context distance for being defined as the Anchor Text and existing entry do not included apart from minimum value drawn.
Step S406_2, the context distance using determination, calculate the anchor do not included described in corresponding web page release The weight of text.
Using the context distance of the Anchor Text do not included and existing entry, Anchor Text that this does not include is calculated in each net Weight in page piecemeal.Context distance is nearer, and weight is bigger.
Weight calculation formula can be, but not limited to use:
(formula 1)
As in Fig. 3, in the web page release, calculated using existing entry Anchor Text " because love " and do not include Anchor Text The weight of " can not hindering ", be specially:
Context apart from d=6, the character string at interval include " 2, Wang Lin ,-, and then obtain weight and be
The like, in each web page release of record, calculate the weight for not including Anchor Text in correspondence piecemeal.
Step S406_3, under whole current class, will in each web page release that extracted calculate obtain described in not The weight for the Anchor Text included is summed, the weight for the Anchor Text do not included.
Under whole current class, step S406_2 is calculated to the obtained power for not including Anchor Text in each piecemeal Summation is weighted again, is used as the weight for not including Anchor Text.
For example:Step S406_2 is calculated to the weight summation for obtaining " not hinder " in each web page release to obtain " can not hindering " Weight be 295.4, judge whether to be more than default weight threshold.
Statistics obtains " can not hindering " and occurred in that in categorizing songs 1442 times, judges whether to be more than default frequency threshold value.
If weight is more than default weight threshold or frequency of occurrence is more than default frequency threshold value, the Anchor Text is recognized For new term.Can be set according to practical application needs two conditions while when meeting, being just identified as new term.
Step S407, judge whether to have obtained all classification in entry base, if it is, into step S408, output The recognition result of new term, otherwise, return to step S401 obtain the existing entry set of next classification in entry base, until Take all classification, output result.
Above is the detailed description carried out to method provided by the present invention, the acquisition of the entry provided below the present invention Device is described in detail.
Embodiment three
Fig. 5 is the acquisition device schematic diagram for the entry that the present embodiment is provided.As shown in figure 5, the device includes:
Existing entry acquisition module 501, the existing entry set for obtaining same classification in entry base.
The entry base can be the classification entry base such as encyclopaedia entry base, input method entry base, in the present invention with encyclopaedia Illustrated exemplified by entry base.
The classification can use classification each original classification of entry base, including:Song, film, personage, nature, text The classification such as change, geography, history, life, society, art, economy, science and technology, physical culture, or, it can utilize existing to existing entry Classification or clustering method (such as bayes classification method, traditional decision-tree, support vector machines) divide classification.
The existing entry set of same classification in entry base is obtained, one by one the existing entry that each in entry base is classified is carried Supply search module 502 and extraction module 503 are performed.
Search module 502, the existing entry set for being obtained using existing entry acquisition module 501 is scanned for, and is obtained To including the Anchor Text of the existing entry, and record the web placement where the Anchor Text of the existing entry.
In internet web page, scanned for using the existing entry set of acquisition, obtain the anchor text comprising existing entry This, records the web placement where those Anchor Texts and Anchor Text.
Web placement where Anchor Text can include:The web page release where webpage, Anchor Text where Anchor Text with And position of the Anchor Text in web page release.Fig. 2 be a webpage and its comprising web page release schematic diagram, as shown in Fig. 2 anchor Web placement where text 1 is first position in the web page release A of the webpage.
For example, existing categorizing songs set T1 in encyclopaedia entry is got by existing entry acquisition module 501, Categorizing songs set T1 includes tens of thousands of existing entries, for example, { because love, like that you arrive bitterly and do not know pain, etc.. }.It is logical Cross search and find the Anchor Text for including existing entry in categorizing songs set T1, for example, being entered using existing entry " because love " Row search, in http:Anchor Text " because love " is found in //ting.baidu.com webpages, as shown in figure 3, recording anchor text Web page release and web placement where this " because love ".
Or, when scanning for the Anchor Text comprising the existing entry, it can also first obtain each net on internet All Anchor Texts of page, recycle the existing entry set of each classification to be matched, and find out the Anchor Text that can be matched, and record should Webpage, web page release and web placement where a little Anchor Texts.
Extraction module 503, for the web placement recorded according to search module 502, extracted in corresponding position with it is described Context distance between the Anchor Text of existing entry meets the Anchor Text of preset requirement.
For the web placement of the Anchor Text of existing entry recorded, extract and met with web placement context distance It is required that Anchor Text be used as entry.
Wherein, the context distance meets preset requirement and can included:
Web page release where the Anchor Text extracted is identical with the web page release where the Anchor Text of existing entry.As schemed Anchor Text 1 in 2 is identical with the web page release where Anchor Text 3, but Anchor Text 1 and Anchor Text 5 are then in different webpage point In block.If Anchor Text 1 is the Anchor Text of existing entry, the Anchor Text that can extract satisfaction requirement is:Anchor Text 2 and anchor Text 3.
Web page release that specifically, can be according to where page layout label determines Anchor Text, such as page layout label "< div></div>" and "<table></table>" etc. judged, it is determined whether in identical web page release.Or, also may be used To determine same web page release according to webpage visual piecemeal etc..
Or, the web page release where the Anchor Text extracted and the web page release phase where the Anchor Text of existing entry Together, and the spacing distance of the Anchor Text that is extracted and the Anchor Text of existing entry is less than pre-determined distance threshold value.
For example, Fig. 3 is some the web page release schematic diagram searched using existing entry " because love ", in figure 3, The anchor such as " Wang Fei ", " can not hindering ", " Wang Lin ", " most dazzling national wind ", " phoenix legend ", " new Drunken Concubine ", " keeping of love " text The Anchor Text " because love " of sheet and existing entry is in same web page release, extracts those Anchor Texts as entry.
In order to further improve precision, the Anchor Text of preset requirement is met in extraction context distance, also to spacing distance Limit.If the Anchor Text " because love " of the Anchor Text such as " new Drunken Concubine ", " keeping of love " and existing entry in Fig. 3 Between spacing distance when having exceeded pre-determined distance threshold value, then do not extract those Anchor Texts.
The pre-determined distance threshold value is set according to actual needs, such as within 10 characters.
Example IV,
Fig. 6 is the acquisition device schematic diagram for the entry that the present embodiment is provided, as shown in fig. 6, the device includes:
Existing entry acquisition module 601, the existing entry set for obtaining same classification in entry base.
Search module 602, the existing entry set for being obtained using existing entry acquisition module 601 is scanned for, and is obtained To including the Anchor Text of the existing entry, and record the web placement where the Anchor Text of the existing entry.
Extraction module 603, for the web placement recorded according to search module 602, extracted in corresponding position with it is described Context distance between the Anchor Text of existing entry meets the Anchor Text of preset requirement.
Above-mentioned module 601 to 603 is corresponding identical with 501 to 503 configuration in embodiment three, is repeated no more in this.
Existing word filtering module 604, for the Anchor Text extracted to be contrasted with the entry base, is not received The Anchor Text of record.
Because the Anchor Text extracted is possible for existing entry, thus, in order to improve efficiency, to the Anchor Text extracted Filtered, existing word filtering is fallen, subsequently only to handle the Anchor Text do not included.If " leading in Fig. 3 Hand ", " betrayal love song " are existing entries, then are filtered out.
Because the Anchor Text extracted under some classification may belong to other classification, for example, can be extracted in Fig. 3 The personages such as " Wang Fei ", " Wang Lin ".Thus, the Anchor Text extracted and whole entry base are contrasted, removes and is present in word Anchor Text in bar storehouse, the Anchor Text do not included.If the Anchor Text do not included belongs to personage or other default related point Entry under class, is also retained, and supplies follow-up part of speech filtering module 605 and new term identification module 606 is further is located Reason.The default relevant classification refers to the classification with incidence relation, rule of thumb sets, for example, categorizing songs and personage, electricity The classification such as shadow, amusement has incidence relation.
What deserves to be explained is, when treatment effeciency is less demanding, this module can also be not provided with, or, can also be new Whether recycle this module to be identified after the weight or frequency that obtain Anchor Text in entry identification module 606 is not include, To determine new term.Now, part of speech filtering module 605 and new term identification module 606 are then that the Anchor Text extracted is performed.
Part of speech filtering module 605, specifies the Anchor Text of part of speech to filter out for will not include in the Anchor Text do not included.
The Anchor Text obtained for existing word filtering module 604, falls not including by participle, part-of-speech tagging technical filter Specify the Anchor Text of part of speech, the Anchor Text such as filtering out not comprising verb, noun, adjective.
Meanwhile, in order to obtain the entry of specification, be also based on Anchor Text length and comprising punctuation mark carried out Filter, undesirable Anchor Text is filtered out.
Certainly, this module is also not necessary module.
New term identification module 606, for being calculated according to the context distance between the Anchor Text of the existing entry The weight of the Anchor Text do not included, the frequency that the Anchor Text do not included described in statistics occurs in current class, by frequency Or weight meets the Anchor Text of preset requirement and is identified as new term.
The frequency that remaining Anchor Text occurs in current class after statistics part of speech filtering module 605 is filtered, that is, go out occurrence Number, and calculate the weight of remaining Anchor Text after part of speech filtering module 605 is filtered, specifically, according to the existing entry Context distance between Anchor Text calculates the weight of Anchor Text, including:
Distance determining unit, in same web page release, it is determined that the Anchor Text do not included and existing entry The context distance of Anchor Text.
Specifically, distance determining unit first determines that what is included in the web page release where the Anchor Text do not included has The Anchor Text of entry.The distance between the Anchor Text do not included and Anchor Text of each existing entry of acquisition are calculated again.
Wherein, context can be, but not limited to using the word being spaced between the Anchor Text do not included and existing entry apart from d Accord with string length to calculate, not including symbols such as page layout label, space, carriage returns.
Finally, the minimum value of distance determining unit selected distance is used as the context distance with existing entry.
For example, there are Anchor Text K1, K2, K3 ... the Kn of multiple existing entries in same web page release, and multiple do not receive Anchor Text L1, L2, L3 of record etc., one by one to the Anchor Text do not included in the web page release, calculate the distance to K1~Kn respectively, By the context distance for being defined as the Anchor Text and existing entry do not included apart from minimum value drawn.
Weight calculation unit, for the context distance determined using distance determining unit, is calculated in corresponding webpage point The weight for the Anchor Text do not included described in block.
Weight calculation unit calculates the anchor do not included using the context distance of the Anchor Text do not included and existing entry Weight of the text in each web page release, context distance is nearer, and weight is bigger.
Weight calculation formula can be, but not limited to be calculated using formula 1.
As in Fig. 3, in the web page release, calculated using existing entry Anchor Text " because love " and do not include Anchor Text The weight of " can not hindering ", be specially:
Context apart from d=6, the character string at interval include " 2, Wang Lin ,-, and then obtain weight and be
The like, in each web page release of record, calculate the weight for not including Anchor Text in correspondence piecemeal.
Weighted units, under whole current class, will be calculated in each web page release extracted described in obtaining The weight for the Anchor Text do not included is summed, the weight for the Anchor Text do not included.
Under whole current class, Anchor Text is not included in each piecemeal by what weight calculation unit calculating was obtained Weight is weighted summation, is used as the weight for not including Anchor Text.
For example:Weight calculation unit is calculated to the weight summation for obtaining " not hinder " in each web page release to obtain " hindering not Rise " weight be 295.4, judge whether to be more than default weight threshold.
The statistics of new term identification module 606 obtains " can not hindering " and occurred in that in categorizing songs 1442 times, judges whether big In default frequency threshold value.
If weight is more than default weight threshold or frequency of occurrence is more than default frequency threshold value, the Anchor Text is recognized For new term.Can be set according to practical application needs two conditions while when meeting, being just identified as new term.
Judge module 607, for judging whether to have obtained all classification in entry base, if it is, defeated into result Go out module 608, export the recognition result of new term, otherwise, existing entry acquisition module 601 is back to, under obtaining in entry base The existing entry set of one classification, until taking all classification, output result.
The acquisition methods and device for the entry that the present invention is provided, excavating entity entry with existing dictionary, there is provided not yet create New term, user can be instructed to create the corresponding knowledge of new term, entity entry in encyclopaedia database is solved and include asking for deficiency Topic, is conducive to the data information (entity entry-attribute-name-property value) of perfect frame, is easy to implement more effective knowledge and searches Rope.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention God is with principle, and any modification, equivalent substitution and improvements done etc. should be included within the scope of protection of the invention.

Claims (18)

1. a kind of acquisition methods of entry, it is characterised in that including:
S1, the existing entry set for obtaining same classification in entry base;
S2, scanned for using acquired existing entry set, obtain including the Anchor Text of the existing entry, and record institute State the web placement where the Anchor Text of existing entry;
S3, according to the web placement recorded, extracted in corresponding position above and below between the Anchor Text of the existing entry Literary distance meets the Anchor Text of preset requirement.
2. according to the method described in claim 1, it is characterised in that after the step S3, in addition to:
S4, the weight according to the extracted Anchor Text of the context distance calculating between the Anchor Text of the existing entry, system The frequency that the extracted Anchor Text of meter occurs in current class, the Anchor Text that frequency or weight are met into preset requirement is identified as New term.
3. method according to claim 1 or 2, it is characterised in that the web placement where the Anchor Text, including:
Position of the web page release and Anchor Text where webpage, Anchor Text in web page release where Anchor Text.
4. method according to claim 3, it is characterised in that the context distance, which meets preset requirement, to be included:
Web page release where the Anchor Text extracted is identical with the web page release where the Anchor Text of existing entry.
5. method according to claim 4, it is characterised in that the context distance, which is met, to be required, in addition to:
The spacing distance of the Anchor Text extracted and the Anchor Text of existing entry is less than pre-determined distance threshold value.
6. method according to claim 2, it is characterised in that between the basis and the Anchor Text of the existing entry Context distance calculates the weight of extracted Anchor Text, specifically includes:
In same web page release, it is determined that the context distance of the Anchor Text extracted and the Anchor Text of existing entry;
Using the context distance of determination, the weight of the Anchor Text extracted in corresponding web page release is calculated;
Under whole current class, the weight that the obtained Anchor Text extracted is calculated in each web page release extracted is entered Row summation, obtains the weight of extracted Anchor Text.
7. method according to claim 6, it is characterised in that described that extracted anchor text is determined in same web page release Originally with the context distance of the Anchor Text of existing entry, specifically include:
The Anchor Text of the existing entry included in web page release where the extracted Anchor Text of determination;
The distance between Anchor Text of each existing entry of the extracted Anchor Text of calculating and acquisition;
The minimum value of selected distance is used as the context distance with existing entry.
8. method according to claim 6, it is characterised in that after the step S3, in addition to:
The Anchor Text extracted is contrasted with the entry base, the Anchor Text do not included;
The step S4 only is performed to the Anchor Text do not included.
9. method according to claim 2, it is characterised in that after the step S3, in addition to:
The Anchor Text of part of speech is specified to filter out by not including in the Anchor Text extracted;
The step S4 only is performed to remaining Anchor Text after filtering.
10. a kind of acquisition device of entry, it is characterised in that including:
Existing entry acquisition module, the existing entry set for obtaining same classification in entry base;
Search module, the existing entry set for being obtained using the existing entry acquisition module is scanned for, comprising The Anchor Text of the existing entry, and record the web placement where the Anchor Text of the existing entry;
Extraction module, for the web placement recorded according to the search module, is extracted and the existing word in corresponding position Context distance between the Anchor Text of bar meets the Anchor Text of preset requirement.
11. device according to claim 10, it is characterised in that the device also includes:
New term identification module, for calculating described extract according to the context distance between the Anchor Text of the existing entry The weight for the Anchor Text that module is extracted, the frequency that the Anchor Text for counting extracted occurs in current class, by frequency or weight The Anchor Text for meeting preset requirement is identified as new term.
12. the device according to claim 10 or 11, it is characterised in that the web placement where the Anchor Text, including:
Position of the web page release and Anchor Text where webpage, Anchor Text in web page release where Anchor Text.
13. device according to claim 12, it is characterised in that the context distance, which meets preset requirement, to be included:
Web page release where the Anchor Text extracted is identical with the web page release where the Anchor Text of existing entry.
14. device according to claim 13, it is characterised in that the context distance, which is met, to be required, in addition to:
The spacing distance of the Anchor Text extracted and the Anchor Text of existing entry is less than pre-determined distance threshold value.
15. device according to claim 11, it is characterised in that the new term identification module, including:
Distance determining unit, in same web page release, it is determined that the Anchor Text extracted and the Anchor Text of existing entry Context distance;
Weight calculation unit, for the context distance determined using the distance determining unit, is calculated in corresponding webpage point The weight of the Anchor Text extracted in block;
Weighted units, under whole current class, will calculate being extracted of obtaining in each web page release extracted The weight of Anchor Text is summed, and obtains the weight of extracted Anchor Text.
16. device according to claim 15, it is characterised in that the distance determining unit, concrete configuration is:
The Anchor Text of the existing entry included in web page release where the extracted Anchor Text of determination;
The distance between Anchor Text of each existing entry of the extracted Anchor Text of calculating and acquisition;
The minimum value of selected distance is used as the context distance with existing entry.
17. device according to claim 15, it is characterised in that the device also includes:
Existing word filtering module, Anchor Text and the entry base for the extraction module to be extracted are contrasted, obtained The Anchor Text do not included;
And the Anchor Text do not included is supplied to the new term identification module.
18. device according to claim 11, it is characterised in that the device also includes:
Part of speech filtering module, the Anchor Text filtering of part of speech is specified for not including in the Anchor Text that extracts the extraction module Fall;
And remaining Anchor Text after filtering is supplied to the new term identification module.
CN201210151282.6A 2012-05-15 2012-05-15 The acquisition methods and device of a kind of entry Active CN103425660B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210151282.6A CN103425660B (en) 2012-05-15 2012-05-15 The acquisition methods and device of a kind of entry

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210151282.6A CN103425660B (en) 2012-05-15 2012-05-15 The acquisition methods and device of a kind of entry

Publications (2)

Publication Number Publication Date
CN103425660A CN103425660A (en) 2013-12-04
CN103425660B true CN103425660B (en) 2017-10-17

Family

ID=49650418

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210151282.6A Active CN103425660B (en) 2012-05-15 2012-05-15 The acquisition methods and device of a kind of entry

Country Status (1)

Country Link
CN (1) CN103425660B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104978354B (en) * 2014-04-10 2020-11-06 中电长城网际系统应用有限公司 Text classification method and device
CN104102738B (en) * 2014-07-28 2018-04-27 百度在线网络技术(北京)有限公司 A kind of method and device for expanding entity storehouse

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7257530B2 (en) * 2002-02-27 2007-08-14 Hongfeng Yin Method and system of knowledge based search engine using text mining
US7657507B2 (en) * 2007-03-02 2010-02-02 Microsoft Corporation Pseudo-anchor text extraction for vertical search
CN101251854A (en) * 2008-03-19 2008-08-27 深圳先进技术研究院 Method for creating index lexical item as well as data retrieval method and system
CN102043808B (en) * 2009-10-14 2014-06-18 腾讯科技(深圳)有限公司 Method and equipment for extracting bilingual terms using webpage structure

Also Published As

Publication number Publication date
CN103425660A (en) 2013-12-04

Similar Documents

Publication Publication Date Title
CN104408093B (en) A kind of media event key element abstracting method and device
CN103336766B (en) Short text garbage identification and modeling method and device
CN105069102B (en) Information push method and apparatus
CN104615593B (en) Hot microblog topic automatic testing method and device
CN105260359B (en) Semantic key words extracting method and device
CN109582704B (en) Recruitment information and the matched method of job seeker resume
CN109145216A (en) Network public-opinion monitoring method, device and storage medium
CN110297988A (en) Hot topic detection method based on weighting LDA and improvement Single-Pass clustering algorithm
CN109325165A (en) Internet public opinion analysis method, apparatus and storage medium
CN107784092A (en) A kind of method, server and computer-readable medium for recommending hot word
CN110020189A (en) A kind of article recommended method based on Chinese Similarity measures
JP2005085285A5 (en)
EP2557511B1 (en) Information processing device, information processing method, information processing programme, and recording medium
CN103902619B (en) A kind of network public-opinion monitoring method and system
CN103077190A (en) Hot event ranking method based on order learning technology
CN104951469B (en) Optimize the method and apparatus of corpus
CN102929861A (en) Method and system for calculating text emotion index
CN107945033A (en) A kind of analysis method of network public-opinion, system and relevant apparatus
CN103646074B (en) It is a kind of to determine the method and device that picture cluster describes text core word
CN103186556A (en) Method for obtaining and searching structural semantic knowledge and corresponding device
CN107832467A (en) A kind of microblog topic detecting method based on improved Single pass clustering algorithms
CN109299469A (en) A method of identifying complicated address in long text
CN109033166A (en) A kind of character attribute extraction training dataset construction method
CN107526792A (en) A kind of Chinese question sentence keyword rapid extracting method
CN113076735A (en) Target information acquisition method and device and server

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant