WO2014206186A1 - 一种用于生成词条信息的方法和装置 - Google Patents

一种用于生成词条信息的方法和装置 Download PDF

Info

Publication number
WO2014206186A1
WO2014206186A1 PCT/CN2014/079220 CN2014079220W WO2014206186A1 WO 2014206186 A1 WO2014206186 A1 WO 2014206186A1 CN 2014079220 W CN2014079220 W CN 2014079220W WO 2014206186 A1 WO2014206186 A1 WO 2014206186A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
classification
candidate
index
determining
Prior art date
Application number
PCT/CN2014/079220
Other languages
English (en)
French (fr)
Inventor
张伟
李海波
徐惠
卢佳
Original Assignee
百度在线网络技术(北京)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 百度在线网络技术(北京)有限公司 filed Critical 百度在线网络技术(北京)有限公司
Publication of WO2014206186A1 publication Critical patent/WO2014206186A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Definitions

  • the present invention relates to the field of computer technologies, and in particular, to a method and apparatus for generating entry information. Background technique
  • a method for generating term information comprises the following steps:
  • the classification index corresponds to at least one category related webpage
  • d generates the term information corresponding to the candidate word according to at least one category related web page corresponding to the category index information.
  • a term generating apparatus for generating a term information wherein the term generating device includes:
  • a first obtaining device configured to acquire a candidate word
  • a second obtaining means configured to perform searching based on the candidate words to obtain the candidate words Characteristic information
  • a first determining means configured to determine, according to the feature information of the candidate word, a classification index corresponding to the candidate word in the multi-level classification index information; wherein the classification index corresponds to at least one classification related webpage;
  • the first generating means is configured to generate the term information corresponding to the candidate word according to the at least one classified related webpage corresponding to the classification index information.
  • the invention has the advantages that the content related to the entry can be mined from the professional website related to the entry and the entry information is automatically generated, thereby improving the efficiency of generating the entry information and obtaining a more comprehensive and complete word. a message.
  • FIG. 1 is a flow chart of a method for generating term information in accordance with an aspect of the present invention
  • FIG. 2 is a flow chart of a method for generating entry information in accordance with a preferred embodiment of the present invention
  • FIG. 3 is a flow chart of a method for generating entry information in accordance with still another preferred embodiment of the present invention.
  • FIG. 4 is a flowchart of a method for generating entry information according to still another preferred embodiment of the present invention.
  • Figure 5 is a block diagram showing the structure of a term generating apparatus for generating term information according to an aspect of the present invention
  • Figure 6 is a block diagram showing the structure of a term generating apparatus for generating term information according to a preferred embodiment of the present invention
  • Figure 7 is a block diagram showing the structure of a term generating apparatus for generating term information according to still another preferred embodiment of the present invention.
  • FIG. 8 is a schematic structural diagram of a term generating apparatus for generating term information according to still another preferred embodiment of the present invention.
  • FIG. 8 The same or similar reference numerals in the drawings denote the same or similar components. detailed description
  • the method according to the invention comprises a step S1, a step S2, a step S3 and a step S4.
  • the method according to the invention is implemented by a computer device.
  • the computer device includes an electronic device capable of automatically performing numerical calculation and/or information processing in accordance with an instruction set or stored in advance, the hardware of which includes but is not limited to a microprocessor, an application specific integrated circuit (ASIC), a programmable gate Arrays (FPGAs), digital processors (DSPs), embedded devices, and more.
  • the computer device comprises a network device and/or a user device.
  • the user equipment includes, but is not limited to, any electronic product that can interact with a user through a keyboard, a mouse, a remote controller, a touch pad, or a voice control device, for example, a personal computer, a tablet, or a smart phone. , PDA, game console, or IPTV.
  • the network where the user equipment is located includes, but is not limited to, the Internet, a wide area network, a metropolitan area network, a local area network, a VPN network, and the like.
  • user equipment and the network are only examples, and other existing or future user equipments and networks may be applicable to the present invention, and are also included in the scope of the present invention and are included by reference. herein.
  • step S1 the computer device acquires a candidate word.
  • the manner of obtaining candidate words includes, but is not limited to, any one of the following manners:
  • step S2 the computer device performs a search based on the candidate words to obtain feature information of the candidate words.
  • the feature information includes one or more pieces of text information.
  • the text information includes but is not limited to any one of the following:
  • the feature information includes one or more pieces of text information and weight information of each piece of text information.
  • the manner in which the computer device performs a search based on the candidate words to obtain the feature information of the candidate words includes, but is not limited to, any one of the following:
  • the computer device searches in a thesaurus containing a plurality of candidate words and their corresponding feature information to obtain feature information corresponding to the candidate words obtained in step S1.
  • the computer device performs a search based on the candidate words by the first predetermined search engine to acquire one or more search result web pages corresponding to the candidate words; and then, the computer device is configured according to the one or more search result web pages , determining feature information corresponding to the candidate word.
  • the first predetermined search engine includes, but is not limited to, a search engine that can perform a search based on candidate words and acquire one or more search result web pages.
  • the manner in which the computer device determines the feature information corresponding to the candidate word according to the one or more search result web pages includes, but is not limited to, one of the following: a) acquiring the one or more search results At least one keyword included in the webpage; obtaining weight information of each of the at least one keyword; determining a feature corresponding to the candidate word based on the obtained each keyword and its corresponding weight information information.
  • the weight information is determined according to at least one of the following information: 1) an appearance frequency of the keyword in the one or more search result webpages;
  • the weight information is determined based on a value of a term frequency-inverse document frequency (TF-IDF) of each keyword in the one or more search result web pages.
  • TF-IDF term frequency-inverse document frequency
  • the computer device performs word segmentation on the webpage content of the one or more search result webpages to obtain at least one keyword, and counts the at least one key The weight information of each keyword in the word, and then, based on the obtained each keyword and its weight information, one or more keywords are selected from the at least one keyword as the feature information corresponding to the candidate word.
  • the computer device selects one or more search result web pages from all of the search result web pages corresponding to the candidate words, and determines feature information corresponding to the candidate words based on the selected search result web pages.
  • the candidate words obtained by the computer device in step S1 include "Maldives", and the computer device searches for "Maldives” by a predetermined search engine, such as a Baidu search engine, and obtains a plurality of search result web pages. .
  • the computer device selects the top ten search result web pages webl to weblO in the search results as one or more search result web pages corresponding to the candidate words.
  • the computer device cuts the webpage content of the selected ten search result webpages to obtain a plurality of keywords, and counts the TF-IDF value of each keyword relative to the ten search result webpages, and obtains the obtained
  • the TF-IDF value is used as the weight information of each keyword; and, the computer device sorts the keywords according to the TF-IDF value, selects the top 20 keywords, and selects the top 20 keywords. And their respective corresponding TF-IDF values are used as feature information of the candidate word "Maldives".
  • the topic related information determines feature information corresponding to the candidate word.
  • the predetermined topic determination model is configured to perform operations such as data mining on a given text information by a predetermined model to obtain topic related information corresponding to the text information.
  • LDA Latent Dirichlet Allocation Model
  • PLSA Probabilistic Latent Semantic Analysis
  • Labeled LDA Labeled Latent Dirichlet Allocation Model
  • the topic related information includes information for characterizing one or more topics of the text information, for example, a plurality of keys for characterizing a topic of the text information Words, etc.
  • the subject related information further includes information for characterizing weights of the one or more topics in the text, for example, a key corresponding to a plurality of keywords used to represent the subject of the text information Word weights and so on.
  • step S3 the computer device determines a classification index corresponding to the candidate word in the multi-level classification index information according to the feature information of the candidate word.
  • the multi-level classification index information includes a plurality of classification indexes that are related to each other based on a predetermined topology structure, wherein each of the classification indexes respectively corresponds to at least one classification related webpage.
  • the computer device acquires a similarity between the feature information of the candidate word and at least one classified related webpage of each category index in the multi-level classification index information, and determines a classification corresponding to the candidate word based on the similarity degree. index.
  • step S4 the computer device determines the term information corresponding to the candidate word according to at least one category related web page corresponding to the category index information.
  • the computer device acquires, by the at least one classified related webpage corresponding to the classification index, webpage content related to the candidate word, to generate an entry corresponding to the candidate word that belongs to the classification index. information.
  • the manner in which the computer device obtains the content information related to the candidate word from the at least one classified related webpage includes:
  • the computer device mines the webpage content corresponding to the candidate word and its feature information from the at least one classified related webpage as the content information of the term information corresponding to the candidate word according to the candidate word and its feature information. .
  • the multi-level classification index information includes a classification index associated with a predetermined tree topology as shown in Table 1 below: Table 1
  • each of the classification indexes corresponds to a plurality of classification related web pages
  • the computer device determines in step S3 that the classification index corresponding to the candidate word "Maldives” is "domestic tour", and the computer device corresponds to the "domestic tour” corresponding to the classification index.
  • the computer device acquires and describes the at least one classification related webpage corresponding to the classification tool
  • the content information related to the candidate word is used to update the term information corresponding to the candidate word.
  • the content of the term information can be automatically obtained from the classification having higher similarity with the candidate words, thereby greatly improving the effect of generating and updating the term information. Moreover, the content of the classified related web page can be more fully explored and utilized.
  • Figure 2 illustrates a flow chart of a method for generating term information in accordance with a preferred embodiment of the present invention.
  • the method according to the present embodiment includes steps S1 to S4, step S5, step S6, and step S7.
  • step S5 the computer device acquires one or more network posting information corresponding to the candidate word.
  • the network publishing information includes a certain type of information for being published on the Internet.
  • the network posting information includes an advertisement.
  • the computer device acquires one or more networks corresponding to the candidate words Ways to post information include, but are not limited to, one of the following:
  • the computer device obtains one or more network posting information corresponding to the candidate word by querying the candidate word in a second predetermined search engine.
  • the second predetermined search engine includes, but is not limited to, a search engine that can perform a search based on candidate words and acquire one or more network publishing information.
  • the second predetermined search engine is the same search engine as the first predetermined search engine described above with reference to the embodiment of FIG.
  • the computer device acquires one or more network posting information corresponding to the candidate word by a predetermined correspondence between the predetermined candidate words and the network publishing information.
  • step S6 the computer device determines the importance information of the candidate word based on the obtained one or more network distribution information.
  • the manner in which the computer device determines the importance information of the candidate word according to the obtained one or more network publishing information includes, but is not limited to, any one of the following:
  • the computer device counts weight information of the candidate word relative to the one or more network posting information.
  • the computer device counts the TF-IDF value of the candidate word relative to the plurality of advertisements corresponding thereto as the importance information of the candidate word.
  • the computer device counts the quantity of the one or more network publishing information, and uses it as the importance information of the candidate word;
  • the computer device acquires the used information of the one or more network posting information, and determines the importance information of the candidate word according to the obtained used information.
  • the used information of the network publishing information includes but is not limited to at least one of the following:
  • the computer device counts the number of clicks of all advertisements corresponding to the candidate words and uses them as the importance information of the candidate words; for example, the computer device counts the average number of clicks of the advertisement corresponding to the candidate words to use as the The importance information of the candidate words, etc.
  • step S7 the computer device determines whether the importance information of the candidate word satisfies a predetermined importance condition.
  • the predetermined importance condition includes a predetermined importance threshold
  • the computer device determines whether the importance information of the candidate word satisfies a predetermined threshold.
  • step S2 when the importance information of the candidate word satisfies the predetermined importance condition, the computer device acquires the feature information of the candidate word.
  • the step S5 further includes a step S501 (not shown) and a step S502 (not shown), the step S6 further comprising a step S60 (not shown) 1 and a step S602 ( Figure not shown).
  • step S501 the computer device cuts the candidate words to obtain a plurality of sub-candidate words.
  • step S502 the computer device performs a search based on each of the sub-candidate words by the second predetermined search engine to acquire the network posting information corresponding to each of the sub-candidate words.
  • the word is the same or similar to the one or more network publishing information corresponding to the candidate word, and therefore will not be described again.
  • step S601 the computer device determines the sub-importance information of the sub-candidate words based on the network posting information corresponding to each sub-candidate word.
  • the information is in the same or similar way, so it will not be described again.
  • step S602 the computer device determines the importance information of the candidate words based on the sub-importance information of each of the sub-candidate words.
  • the computer device determines sub-importance information of each sub-candidate word based on predetermined statistical rules.
  • the computer device determines each sub-candidate based on predetermined statistical rules
  • the way of sub-importance information includes but is not limited to any of the following:
  • the computer device determines the average importance information according to the sub-importance information of each sub-candidate word, and uses it as the importance information of the candidate word.
  • the computer device acquires the weight values of the respective sub-candidates with respect to the candidate words to which they belong, and determines the importance information of the candidate words based on the sub-importance information of each of the sub-candidate words and the weight values of the respective sub-candidate words.
  • determining a weight value of each sub-candidate word based on the number of occurrences of each sub-candidate word in the candidate word to which it belongs and determining a candidate based on the sub-importance information of each sub-candidate word and the weight value of each sub-candidate word The importance information of the word.
  • the term is generated only for the candidate words satisfying the predetermined importance condition, and the term generation efficiency is improved.
  • Figure 3 illustrates a flow chart of a method for generating term information in accordance with yet another preferred embodiment of the present invention.
  • the method according to the present embodiment includes steps S1 to S4, step S8, and step S9.
  • step S8 the computer device acquires webpage navigation information of one or more websites.
  • the one or more websites may be manually designated one or more websites having certain similarities, or may be one determined by performing cluster analysis on webpage contents of a large number of websites, and having a certain similarity. Or multiple websites.
  • the webpage navigation information includes, but is not limited to, information that provides a prompt for the user to browse the webpage based on the webpage column structure in the website.
  • step S9 the computer device generates multi-level classification index information according to the obtained one or more webpage navigation information, wherein each of the multi-level classification indexes is associated with each other according to a predetermined topology.
  • the manner in which the computer device generates the multi-level classification index information according to the obtained one or more webpage navigation information includes, but is not limited to, any one of the following:
  • one or more columns commonly included in the navigation bar of the plurality of websites are used as a classification index, and the belonging relationship between the respective columns in one of the website navigation columns is selected as the obtained between the respective classification indexes.
  • a reference to the relationship to generate a multi-level classification index is used as a classification index, and the belonging relationship between the respective columns in one of the website navigation columns.
  • the method according to the present solution further includes a step S10 (not shown), a step S11 (not shown), and the step S3 further includes a step S301 (not shown).
  • step S10 the computer device acquires the classified related webpage corresponding to each of the multi-level classification index information according to the webpage navigation information of the one or more websites corresponding to the multi-level classification index information.
  • the computer device determines, according to the webpage navigation information of the one or more websites corresponding to the multi-level classification index information, webpage navigation information of the one or more websites corresponding to each category index respectively. Part of the navigation information, and acquiring at least one site webpage corresponding to the partial navigation information in the one or more websites as the category related webpage corresponding to the category index.
  • step S11 the computer device determines the classification feature information respectively corresponding to the respective classification indexes based on the classification related web pages corresponding to the respective classification indexes.
  • the one or more search result web pages are used to determine the feature information corresponding to the candidate words in the same or similar manner, and details are not described herein again.
  • step S301 the computer device determines, according to the feature information of the candidate word and the classification feature information of each category index, a classification line corresponding to the candidate word. Quote.
  • the computer device compares the feature information of the candidate word with the classification feature information of each classification index, and selects a classification index that the similarity between the classification feature information and the feature information of the candidate word satisfies a predetermined similarity condition, as a context
  • the classification index corresponding to the candidate is a classification index that the similarity between the classification feature information and the feature information of the candidate word satisfies a predetermined similarity condition, as a context.
  • the predetermined similarity condition includes that the similarity satisfies a predetermined similarity threshold.
  • the predetermined topology structure includes a multi-level topology structure, wherein the classification indexes of the two adjacent levels are affiliation, wherein the step S3 further includes step S302 ( Figure not shown) and step S303 (not shown).
  • the predetermined topology structure comprises a multi-level tree structure, and the adjacent two levels of classification indexes are affiliation relationships.
  • step S302 the computer device compares the feature information of the candidate words with the classification feature information of the respective classification indexes to obtain a classification index whose classification feature information is similar to the feature information of the candidate words.
  • the computer device compares the feature information of the candidate words with the classification feature information of the respective classification indexes one by one according to the predetermined traversal structure to obtain the classification feature information and the candidate according to the predetermined topology.
  • a classification index with similar feature information of words is a classification index with similar feature information of words.
  • the classification index that has not been traversed is randomly acquired, and the classification feature information of the classification index is compared with the feature information of the candidate words to obtain A classification index whose classification feature information is similar to the feature information of the candidate word.
  • the classification index as the leaf node is first obtained, and the classification feature information of the classification index of the layer is Comparing the feature information of the candidate words, when the classification index similar to the feature information of the candidate word is not obtained in the leaf node, acquiring the classification index of the node of the layer above the leaf node, and The classification feature information of the classification index of the layer is compared with the feature information of the candidate word, and is sequentially layer by layer until a classification index similar to the feature information of the candidate word is obtained.
  • the computer device uses the underlying classification index as a classification index corresponding to the candidate word.
  • the computer device determines whether the obtained classification index is an underlying classification index, and when the obtained classification index includes an underlying classification index, the computer device uses the bottom layer classification index as a classification index corresponding to the candidate word.
  • the step S3 further includes a step S304 (not shown) and a step S305 (not shown).
  • step S304 when the obtained classification index does not include the underlying index node, the computer device generates the classification at the lowest level based on the one or more classification related web pages corresponding to the lowest level classification index and the candidate words.
  • the subordinate classification index of the index is the subordinate classification index of the index.
  • the manner in which the computer device generates the lower-level classification index of the classification index of the lowest level based on one or more classification related webpages corresponding to the lowest-level classification index and the candidate words includes but is not limited to any of the following Kind:
  • the computer device generates, according to the candidate word, a name of a subordinate classification index that belongs to the classification index obtained in the foregoing step S302, and determines, according to the search result page corresponding to the candidate word and the classification related page corresponding to the obtained classification index,
  • the lower-level classification cable ⁇ I corresponds to the classification related web page.
  • the computer device queries and acquires at least one webpage related to the candidate word in the one or more webpages based on one or more webpages corresponding to the classification index obtained in the foregoing step S302, and determines that the webpage corresponding to the webpage is determined
  • the central word is taken as the name of the subordinate classification index of the classification index obtained in the foregoing step S302, and the at least one web page is used as the classification related web page corresponding to the subordinate classification index.
  • step S305 the computer device uses the generated underlying classification index as a classification index corresponding to the candidate word.
  • the multi-level classification index is established by acquiring website navigation information of one or more websites, so that the classification index system of the terms is similar to the system in actual use, which is beneficial to more comprehensive mining of professional websites.
  • Content information and because it is also possible to use the webpage content of these websites as a classified index related webpage, Generate more systematic and complete entry letters for candidate words.
  • FIG. 4 is a flow chart of a method for generating term information in accordance with yet another preferred embodiment of the present invention.
  • the method according to the present embodiment includes steps S1 to S4, step S12, step S13, step S14, and step S15.
  • step S12 the computer device acquires one or more web pages of the candidate website.
  • the manner in which the computer device determines the candidate website includes but is not limited to any of the following:
  • step S13 the computer device determines site feature information of the candidate website according to one or more web pages of the candidate website.
  • the result web page is the same or similar in the manner of determining the feature information corresponding to the candidate word, and details are not described herein again.
  • step S14 the computer device compares the site feature information of the candidate website with the classification feature information of each category index to determine one or more category indexes corresponding to the candidate website.
  • step S302 the computer device compares the feature information of the candidate words with the classification feature information of the respective classification indexes to obtain the same or similar manners of the classification index whose classification feature information is similar to the feature information of the candidate words. , will not repeat them here.
  • step S15 the computer device provides the candidate user corresponding to the candidate website with one or more candidate words respectively corresponding to the one or more classification links.
  • the method according to the embodiment further includes step S16 (not shown), step S17 (not shown), and step S18 (not shown).
  • step S16 the computer device obtains one or more candidate webpages corresponding to the one or more classification indexes in the candidate website according to one or more classification indexes corresponding to the candidate website.
  • the manner in which the computer device obtains one or more candidate webpages corresponding to the one or more classification indexes in the candidate website according to one or more classification indexes corresponding to the candidate website includes, but is not limited to, Any of the following:
  • the computer device acquires the classified related webpage of the one or more classified indexes, compares the obtained classified related webpage with the website webpage of the candidate website, to obtain one or more sites similar to the classified related webpage.
  • the web page is used as a candidate web page for the classification index corresponding to the web page related to the category.
  • the computer device obtains, from the candidate website, one or more candidate web pages respectively similar to the classification feature information of the one or more classification indexes according to the classification feature information of the one or more classification indexes.
  • step S17 the computer device determines or updates the classified related webpage corresponding to the respective classifications based on one or more candidate webpages of the candidate website corresponding to the respective classification indexes.
  • the computer device adds the determined candidate web page as a category-related web page corresponding to the category index to the category-related web page library corresponding to each category index.
  • step S18 the computer device updates the term information of the candidate words corresponding to the respective classifications based on the updated classification related web pages corresponding to the respective classification indexes.
  • the updated classification related webpages of the classification index are used to update the term content of each candidate word.
  • the method for updating the term content of each candidate word by using the classified related webpage of the classified index is the same as the computer device according to the step S4 in the embodiment shown in FIG.
  • a method for classifying related web pages to determine the term information corresponding to the candidate words is the same or similar, and is not mentioned here.
  • the term information is automatically updated by using the content of the candidate website, so that the item content can be updated as soon as possible, and the update efficiency is improved.
  • Figure 5 is a block diagram showing the structure of a term generating device for generating term information in accordance with an aspect of the present invention.
  • the term generating device according to the present invention comprises a first obtaining means 1, a second obtaining means 2, a first determining means 3 and a first generating means 4.
  • the first obtaining means 1 acquires candidate words.
  • the manner of obtaining candidate words includes, but is not limited to, any one of the following manners:
  • the second obtaining means 2 performs a search based on the candidate words to acquire feature information of the candidate words.
  • the feature information includes one or more pieces of text information.
  • the text information includes but is not limited to any one of the following:
  • the feature information includes one or more pieces of text information and weight information of each piece of text information.
  • the manner in which the second acquiring device 2 performs a search based on the candidate words to obtain the feature information of the candidate words includes, but is not limited to, any one of the following:
  • the second obtaining means 2 searches in a vocabulary containing a plurality of candidate words and their corresponding feature information to obtain feature information corresponding to the candidate words obtained in step S1.
  • the first search device (not shown) of the second obtaining device 2 performs a search based on the candidate words by the first predetermined search engine to acquire one or more search result web pages corresponding to the candidate words
  • the second determining device (not shown) in the second obtaining device 2 determines the feature information corresponding to the candidate word according to the one or more search result web pages.
  • the first predetermined search engine includes, but is not limited to, a search engine that can perform a search based on candidate words and acquire one or more search result web pages.
  • the manner of determining the feature information corresponding to the candidate word includes, but is not limited to, any one of the following: a) a keyword obtaining device (not shown) in the second determining device, acquiring the one or more search result web pages At least one keyword included; then, a weight obtaining device (not shown) in the second determining device acquires weight information of each keyword in the at least one keyword; and then, the first child in the second determining device
  • the determining means determines the feature information corresponding to the candidate words based on the obtained respective keywords and their corresponding weight information.
  • the weight information is determined according to at least one of the following information: 1) an appearance frequency of the keyword in the one or more search result webpages;
  • the weight information is determined based on a value of a term frequency-inverse document frequency (TF-IDF) of each keyword in the one or more search result web pages.
  • TF-IDF term frequency-inverse document frequency
  • the keyword obtaining means performs word segmentation processing on the webpage content of the one or more search result webpages to obtain at least one keyword
  • the weight obtaining means counts and determines the weight of each keyword in the at least one keyword.
  • Information next, the first sub-determining means selects one or more keywords from the at least one keyword as the feature information corresponding to the candidate words, based on the obtained respective keywords and their weight information.
  • the computer device selects one or more search result web pages from all of the search result web pages corresponding to the candidate words, and determines feature information corresponding to the candidate words based on the selected search result web pages.
  • the candidate words obtained by the first obtaining means 1 include "Maldives", and the first search means searches for "Maldives” by a predetermined search engine, such as a Baidu search engine, to obtain a plurality of search result web pages. And selecting the top ten search result web pages webl to webl 0 in the search results as one or more search result web pages corresponding to the candidate words.
  • a predetermined search engine such as a Baidu search engine
  • the device weight obtaining device performs a word segmentation on the webpage content of the selected ten search result webpages to obtain a plurality of keywords, and the weight obtaining device counts the TF-IDF value of each keyword relative to the ten search result webpages, and The obtained TF-IDF value is used as weight information of each keyword; then, the first sub-determining device sorts the respective keywords according to the TF-IDF value, and selects the top 20 keywords, and ranks the top two The ten-digit keywords and their respective corresponding TF-IDF values are used as the feature information of the candidate word "Maldives".
  • a model determining device in the second determining device determines the model by a predetermined topic, and determines the one or more searches according to webpage content of each webpage in the one or more search result webpages The subject-related information corresponding to the result web page; next, the second sub-determining means (not shown) in the second determining means determines the feature information corresponding to the candidate word based on the determined topic-related information.
  • the predetermined topic determination model is configured to perform operations such as data mining on a given text information by a predetermined model to obtain topic related information corresponding to the text information.
  • LDA Latent Dirichlet Allocation Model
  • PLSA Probabilistic Latent Semantic Analysis
  • Labeled LDA Labeled Latent Dirichlet Allocation Model
  • the subject related information includes information for characterizing one or more topics of the text information, for example, a plurality of key words for characterizing a topic of the text information, and the like.
  • the subject related information further includes information for characterizing weights of the one or more topics in the text, for example, a key corresponding to a plurality of keywords used to represent the subject of the text information Word weights and so on.
  • the person skilled in the art should be able to determine the theme model used according to the actual situation and needs, and the party that obtains one or more topic related information through the topic model, ⁇ f ⁇ .
  • the first determining means 3 determines a classification index corresponding to the candidate word in the multi-level classification index information according to the feature information of the candidate word.
  • the multi-level classification index information includes a plurality of classification indexes that are related to each other based on a predetermined topology structure, wherein each of the classification indexes respectively corresponds to at least one classification related webpage.
  • the first determining device 3 acquires the similarity between the feature information of the candidate word and the at least one classified related webpage of each of the multi-level classification index information, and determines the candidate word based on the similarity. Corresponding classification index.
  • the first generating means 4 determines the term information corresponding to the candidate word based on the at least one category related web page corresponding to the category index information.
  • the first generating device 4 acquires, by the at least one classified related webpage corresponding to the classification index, webpage content related to the candidate word, to generate, corresponding to the candidate word, belonging to the classification index. Entry information.
  • the manner in which the first generating device 4 obtains the content information related to the candidate words from the at least one classified related webpage includes:
  • the first generating device 4 mines webpage content corresponding to the candidate word and its feature information from the at least one classified related webpage as the term information corresponding to the candidate word according to the candidate word and its feature information. Content information.
  • the multi-level classification index information includes a classification index associated with a predetermined tree topology as shown in Table 2 below:
  • each of the classification indexes corresponds to a plurality of classification related web pages
  • the first determining means 3 determines that the classification index corresponding to the candidate word "Maldives” is "domestic tour", and the first generation device 4 selects "domestic tour” from the classification index.
  • the computer device acquires and describes the at least one classification related webpage corresponding to the classification tool
  • the content information related to the candidate word is used to update the term information corresponding to the candidate word.
  • the content of the term information can be automatically obtained from the classification having higher similarity with the candidate words, thereby greatly improving the effect of generating and updating the term information. Moreover, the content of the classified related web page can be more fully explored and utilized.
  • Figure 6 is a block diagram showing the structure of a term generating apparatus for generating term information in accordance with a preferred embodiment of the present invention.
  • the term generating means includes a first obtaining means 1, a second obtaining means 2, a first determining means 3, a first generating means 4, a third obtaining means 5, a third determining means 6, and a judging means 7.
  • the first obtaining device 1, the second obtaining device 2, the first determining device 3, and the first generating device 4 have been described in detail in the embodiment shown in FIG. 5, and are included herein by reference. Let me repeat.
  • the third obtaining device 5 acquires one or more network publishing letters corresponding to the candidate words
  • the network publishing information includes a certain type of information for being published on the Internet.
  • the network posting information includes an advertisement.
  • the manner in which the third acquiring device 5 acquires one or more network publishing information corresponding to the candidate word includes, but is not limited to, any one of the following:
  • the third obtaining means 5 acquires one or more network posting information corresponding to the candidate words by querying the candidate words in a second predetermined search engine.
  • the second predetermined search engine includes, but is not limited to, a search engine that can perform a search based on candidate words and acquire one or more network publishing information.
  • the second predetermined search engine is the same search engine as the first predetermined search engine described above with reference to the embodiment of FIG.
  • the third obtaining means 5 issues a pair of information with the network through predetermined each candidate words It should be related to obtain one or more network publishing information corresponding to the candidate word.
  • the third determining means 6 determines the importance information of the candidate words based on the obtained one or more network posting information.
  • the manner in which the third determining device 6 determines the importance information of the candidate word according to the obtained one or more network publishing information includes, but is not limited to, any one of the following:
  • the third determining means 6 counts the weight information of the candidate words with respect to the one or more network distribution information.
  • the third determining means 6 counts the TF-IDF value of the candidate word with respect to the plurality of advertisements corresponding thereto as the importance information of the candidate word.
  • the third determining device 6 counts the quantity of the one or more network publishing information, and uses it as the importance information of the candidate word;
  • the third determining means 6 acquires the used information of the one or more network posting information, and determines the importance degree information of the candidate word based on the obtained used information.
  • the used information of the network publishing information includes, but is not limited to, at least one of the following: a) the number of times the network publishes information;
  • the third determining means 6 counts the number of clicks of all the advertisements corresponding to the candidate words and uses them as the importance degree information of the candidate words; for example, the third determining means 6 counts the average of the advertisements corresponding to the candidate words to be clicked. The number of times, to use it as the importance information of the candidate words, and the like.
  • the judging means ⁇ judges whether or not the importance information of the candidate word satisfies a predetermined importance condition.
  • the predetermined importance condition includes a predetermined importance threshold
  • the judging means 7 judges whether or not the importance information of the candidate word satisfies a predetermined threshold.
  • the second obtaining means 2 acquires the feature information of the candidate word.
  • the third obtaining device 5 further includes a first sub-acquisition device (not shown) and a second search device (not shown), the third determining device
  • the setting further includes a third sub-determining device (not shown) and a fourth sub-determining device (not shown).
  • the first sub-acquisition device cuts the candidate words to obtain a plurality of sub-candidate words.
  • the second search means performs a search based on each of the sub-candidate words by the second predetermined search engine to acquire the network posting information corresponding to each of the sub-candidate words.
  • the second searching device performs a search based on each sub-candidate word to obtain network publishing information corresponding to each sub-candidate word through the second predetermined search engine, and the third acquiring device 5 passes the second predetermined search engine.
  • the manner in which the candidate words are queried to obtain one or more network publishing information corresponding to the candidate words is the same or similar, and therefore will not be described again.
  • the third sub-determination means determines sub-importance information of the sub-candidate words based on the network distribution information corresponding to each sub-candidate word.
  • the third sub-determining device determines the sub-importance information of the sub-candidate word based on the network distribution information corresponding to each sub-candidate word, and the foregoing computer device determines the candidate according to the obtained one or more network distribution information.
  • the importance information of words is the same or similar, so it will not be repeated.
  • the fourth sub-determining means determines the importance degree information of the candidate words based on the sub-importance information of the respective sub-candidate words.
  • the fourth sub-determining means determines sub-importance information of each sub-candidate word based on a predetermined statistical rule.
  • the fourth sub-determining means determines, according to a predetermined statistical rule, the sub-importance information of each sub-candidate word, including but not limited to any of the following:
  • the fourth sub-determination means determines the average importance information based on the sub-significance information of each sub-candidate word, and uses it as the importance information of the candidate word.
  • the fourth sub-determination device acquires the weight values of the respective sub-candidate words relative to the candidate words to which they belong, and determines the importance degree of the candidate words based on the sub-importance information of each sub-candidate word and the weight value of each sub-candidate word. information.
  • the fourth sub-determining means determines the weight value of each sub-candidate word based on the number of occurrences of each sub-candidate word in the candidate word to which it belongs, and based on the sub-importance information of each sub-candidate word and the weight of each sub-candidate word Value, to determine the importance letter of the candidate
  • the term is generated only for the candidate words satisfying the predetermined importance condition, and the term generation efficiency is improved.
  • Fig. 7 is a block diagram showing the structure of a term generating apparatus for generating term information according to still another preferred embodiment of the present invention.
  • the term generating apparatus according to the present embodiment includes a first obtaining means 1, a second obtaining means 2, a first determining means 3, a first generating means 4, a navigation obtaining means 8, and a second generating means 9.
  • the first obtaining device 1, the second obtaining device 2, the first determining device 3, and the first generating device 4 have been described in detail in the embodiment shown in FIG. 5, and are included herein by reference. Let me repeat.
  • the navigation acquisition device 8 acquires webpage navigation information of one or more websites.
  • the one or more websites may be manually designated one or more websites having certain similarities, or may be one determined by performing cluster analysis on webpage contents of a large number of websites, and having a certain similarity. Or multiple websites.
  • the webpage navigation information includes, but is not limited to, information that provides a prompt for the user to browse the webpage based on the webpage column structure in the website.
  • the second generating means 9 generates multi-level classification index information based on the obtained one or more webpage navigation information, wherein each of the multi-level classification indexes is associated with each other according to a predetermined topology.
  • the manner in which the second generating device 9 generates the multi-level classification index information according to the obtained one or more webpage navigation information includes, but is not limited to, any one of the following:
  • the second generating means 9 directly converts the obtained web page navigation information into a multi-level sorting index.
  • the second generating means 9 uses the respective columns in the navigation column of the website as a classification index, and sequentially stores the belonging relationship between the respective columns as the belonging relationship between the respective classification indexes to generate a multi-level classification index.
  • the second generating device 9 selects and merges the webpage navigation information of the plurality of websites, and generates the term index information based on the selected merged result.
  • the second generating device 9 includes one of the plurality of websites in the navigation bar Or a plurality of columns as a classification index, and selecting an association relationship between the respective columns in one of the website navigation columns as a reference of the obtained belonging relationship between the respective classification indexes to generate a multi-level classification index.
  • the term generating device further includes a fourth acquiring device (not shown) and a first feature determining device (not shown).
  • the fourth obtaining means acquires the classified related webpage respectively corresponding to each of the multi-level classification index information based on the webpage navigation information of the one or more websites corresponding to the multi-level classification index information.
  • the fourth obtaining device determines, according to the webpage navigation information of the one or more websites corresponding to the multi-level classification index information, webpage navigation of the one or more websites corresponding to each category index respectively. Part of the navigation information in the information, and acquiring at least one site webpage corresponding to the partial navigation information in the one or more websites as the category related webpage corresponding to the category index.
  • the first feature determining means determines the classification feature information respectively corresponding to the respective classifications based on the classification related web pages corresponding to the respective classification indexes.
  • the manner of determining the feature information corresponding to the candidate word is the same or similar to the one or more search result web pages, and is not described herein.
  • the first determining means 3 determines the classification index corresponding to the candidate words based on the feature information of the candidate words and the classification feature information of the respective classification indexes.
  • the first determining device 3 compares the feature information of the candidate word with the classification feature information of each category index, and selects a classification index that the similarity between the classification feature information and the feature information of the candidate word satisfies a predetermined similarity condition. As a classification index corresponding to the candidate word.
  • the predetermined similarity condition includes that the similarity satisfies a predetermined similarity threshold.
  • the predetermined topology includes multiple levels.
  • the predetermined topology structure comprises a multi-level tree structure, and the adjacent two levels of classification indexes are affiliation relationships.
  • the comparison obtaining means compares the feature information of the candidate words with the classification feature information of the respective classification indexes to obtain a classification index whose classification feature information is similar to the feature information of the candidate words.
  • the comparison obtaining means compares the feature information of the candidate words with the classification feature information of the respective classification indexes one by one according to the predetermined traversal order according to the predetermined topology, to obtain the classification feature information and the A classification index with similar feature information of candidate words.
  • the classification index that has not been traversed is randomly acquired, and the classification feature information of the classification index is compared with the feature information of the candidate words to obtain A classification index whose classification feature information is similar to the feature information of the candidate word.
  • the classification index as the leaf node is first obtained, and the classification feature information of the classification index of the layer is Comparing the feature information of the candidate words, when the classification index similar to the feature information of the candidate word is not obtained in the leaf node, acquiring the classification index of the node of the layer above the leaf node, and The classification feature information of the classification index of the layer is compared with the feature information of the candidate word, and is sequentially layer by layer until a classification index similar to the feature information of the candidate word is obtained.
  • the first classification determining means uses the underlying classification index as the classification index corresponding to the candidate word.
  • the first classification determining apparatus determines whether the obtained classification index is an underlying classification index, and when the obtained classification index includes an underlying classification index, the first classification determining apparatus uses the underlying classification index as a classification corresponding to the candidate words. index.
  • the first determining device 3 further A third generating device (not shown) and a second sorting determining device (not shown) are included.
  • the third generation means When the obtained classification index does not include the underlying index node, the third generation means generates a lower level of the classification index located at the lowest level based on the one or more classification related web pages corresponding to the lowest level classification index and the candidate words. Classification index.
  • the manner in which the third generation device generates the lower classification index of the classification index of the lowest level based on the one or more classification related webpages corresponding to the lowest level classification index and the candidate words includes but is not limited to the following Any one:
  • the third generating means generates a name of a subordinate classification index belonging to the classification index obtained by the foregoing first classification determining means based on the candidate words, and based on the search result page corresponding to the candidate word and the classification corresponding to the obtained classification index Related pages, determining the category related webpage corresponding to the subordinate classification index.
  • the third generating device queries and acquires at least one webpage related to the candidate word in the one or more site webpages based on one or more webpage pages corresponding to the classification index obtained by the foregoing first classification determining apparatus, and determines The central word corresponding to the web page is used as the name of the subordinate classification index of the classification index obtained by the first classification determining device, and the at least one web page is used as the classification related web page corresponding to the subordinate classification index.
  • the second classification determining means sets the generated underlying classification index as a classification index corresponding to the candidate word.
  • the multi-level classification index is established by acquiring website navigation information of one or more websites, so that the classification index system of the terms is similar to the system in actual use, which is beneficial to more comprehensively mining professional websites.
  • FIG. 8 is a block diagram showing the structure of a term generating apparatus for generating term information according to still another preferred embodiment of the present invention.
  • the term generating apparatus includes a first obtaining means 1, a second obtaining means 2, a first determining means 3, a first generating means 4, a first web page obtaining means 10, a second feature determining means 11, and a third
  • the classification determining device 12 and the providing device 13 are provided.
  • the first obtaining device 1, the second obtaining device 2, the first determining device 3, and the first generating device 4 have been described in detail in the embodiment shown in FIG. 5, and are included herein by reference. Let me repeat.
  • the first web page obtaining means 10 acquires one or more web pages of the candidate website.
  • the manner in which the first webpage obtaining apparatus 10 determines the candidate website includes, but is not limited to, any one of the following:
  • the first webpage obtaining device 10 acquires a manually designated website as a candidate website
  • the first webpage obtaining device 10 compares the crawled webpage page with the webpage corresponding to each sorting index in the multi-level sorting index information, so as to obtain a webpage similar to the webpage corresponding to each of the classifications Website.
  • the second feature determining means 11 determines site feature information of the candidate website based on one or more pages of the candidate website.
  • the second feature determining device 11 determines, according to one or more web pages of the candidate website, the manner of determining the site feature information of the candidate website, and the second determining device in the embodiment shown in FIG. 5 according to the one or more
  • the search result webpages are the same or similar in the manner of determining the feature information corresponding to the candidate words, and are not mentioned here.
  • the third category determining means 12 compares the site feature information of the candidate website with the classification feature information of each category index to determine one or more category indexes corresponding to the candidate website.
  • the third category determining device 12 compares the site feature information of the candidate website with the classification feature information of each category index to determine one or more classification indexes corresponding to the candidate website, and the manner described above with reference to FIG. 7 In the embodiment, the comparison determining device compares the feature information of the candidate words with the classification feature information of the respective classification indexes to obtain the same manner as the classification index whose classification feature information is similar to the feature information of the candidate words. Similar, it will not be repeated here.
  • the providing device 13 provides the candidate user corresponding to the candidate website with one or more candidate words respectively corresponding to the one or more classification links.
  • the term generating apparatus further includes a second webpage obtaining device (not shown), a first updating device (not shown), and the first update device. Set (not shown).
  • the second webpage obtaining means acquires one or more candidate webpages corresponding to the one or more sorting indexes respectively in the candidate website according to one or more sorting indexes corresponding to the candidate website.
  • the manner in which the second webpage obtaining apparatus acquires one or more candidate webpages respectively corresponding to the one or more sorting indexes in the candidate website according to one or more sorting indexes corresponding to the candidate website includes: But not limited to any of the following:
  • the second webpage obtaining device acquires the classified related webpage of the one or more sorting indexes, compares the obtained classified related webpage with the webpage of the candidate website, to obtain one or the similar webpage of the classified webpage
  • a plurality of site web pages are used as candidate webpages for the classification index corresponding to the webpage related to the category.
  • the second webpage obtaining means acquires, by the candidate website, one or more candidate webpages respectively similar to the classification feature information of the one or more classification indexes according to the classification feature information of the one or more classification indexes.
  • the first update means determines or updates the classified related webpage corresponding to the respective classification links based on one or more candidate webpages of the candidate website corresponding to the respective classification indexes.
  • the first update device adds the determined candidate web page as a category-related web page corresponding to the category index to the category-related web page library corresponding to each category index.
  • the first update means updates the term information of the candidate words corresponding to the respective classification indexes based on the updated classification related web pages corresponding to the respective classification indexes.
  • the first update means updates the term content of each candidate word by using the updated category-related webpage of the classified index for one or more candidate words belonging to each of the classification indexes.
  • the manner in which the first update device updates the term content of each candidate word by using the updated classification related webpage of the classification index is the same as the first generation device according to the embodiment shown in FIG. 5 according to the classification index information. At least one of the classified related web pages to determine the term information corresponding to the candidate words is the same or similar, and is not mentioned here. According to the solution of the embodiment, the term information is automatically updated by using the content of the candidate website, so that the item content can be updated as soon as possible, and the update efficiency is improved.
  • the software program of the present invention can be executed by a processor to implement the steps or functions described above.
  • the software program (including related data structures) of the present invention can be stored in a computer readable recording medium such as a RAM memory, a magnetic or optical drive or a floppy disk and the like.
  • some of the steps or functions of the present invention may be implemented in hardware, for example, as a circuit that cooperates with a processor to perform various functions or steps.
  • a portion of the present invention can be applied as a computer program product, such as computer program instructions, which, when executed by a computer, can invoke or provide a method and/or solution in accordance with the present invention.
  • the program instructions for invoking the method of the present invention may be stored in a fixed or removable recording medium and/or transmitted by a data stream in a broadcast or other signal bearing medium, and/or stored in a The working memory of the computer device in which the program instructions are run.
  • an embodiment in accordance with the present invention includes a device including a memory for storing computer program instructions and a processor for executing program instructions, wherein when the computer program instructions are executed by the processor, triggering
  • the apparatus operates based on the foregoing methods and/or technical solutions in accordance with various embodiments of the present invention.

Abstract

提供一种用于生成词条信息的方法和装置。该方法包括:获取候选词;基于所述候选词进行搜索,以获取所述候选词的特征信息;根据所述候选词的特征信息,在多级分类索引信息中确定与所述候选词对应的分类索引;其中,所述分类索引对应至少一个分类相关网页;根据与所述分类索引信息对应的至少一个分类相关网页,来生成与所述候选词对应的词条信息。优点在于,能从与词条相关的专业网站中,全面地挖掘与词条相关的内容并自动生成词条信息,从而提高了词条信息的生成效率,并且能够获得更加全面、完整的词条信息。

Description

一种用于生成词条信息的方法和装置
技术领域
本发明涉及计算机技术领域, 尤其涉及一种用于生成词条信息的 方法和装置。 背景技术
在现有技术中, 仅能依靠用户手动填写内容来生成百科词条的词 条信息, 然而, 这种方式的效率较低, 并且不能及时地对其进行更新; 此外, 还有一种方式是依据搜索相关词条所得到的网页内容来自动生 成词条信息, 但是, 这种方式所获得的网页类型较为繁杂, 且其内容 不成系统, 所生成的词条信息不够完善, 并且, 往往无法有效地利用 与词条相关的专业类网站中的网页内容。 发明内容
本发明的目的是提供一种用于生成词条信息的方法和装置。
根据本发明的一个方面, 提供一种用于生成词条信息的方法, 其 中, 所述方法包括以下步骤:
a获取候选词 ^
b基于所述候选词进行搜索, 以获取所述候选词的特征信息; c才艮据所述候选词的特征信息, 在多级分类索引信息中确定与所述 候选词对应的分类索引; 其中, 所述分类索引对应至少一个分类相关 网页;
d才艮据与所述分类索引信息对应的至少一个分类相关网页, 来生成 与所述候选词对应的词条信息。
根据本发明的一个方面, 提供一种用于生成词条信息的词条生成 装置, 其中, 所述词条生成装置包括:
第一获取装置, 用于获取候选词;
第二获取装置, 用于基于所述候选词进行搜索, 以获取所述候选词 的特征信息;
第一确定装置, 用于才艮据所述候选词的特征信息, 在多级分类索引 信息中确定与所述候选词对应的分类索引; 其中, 所述分类索引对应 至少一个分类相关网页;
第一生成装置, 用于才艮据与所述分类索引信息对应的至少一个分类 相关网页, 来生成与所述候选词对应的词条信息。
本发明的优点在于, 能够从与词条相关的专业网站中, 挖掘与 词条相关的内容并自动生成词条信息, 从而提高了词条信息的生成 效率, 并且能够获得更加全面、 完整的词条信息。 附图说明
通过阅读参照以下附图所作的对非限制性实施例所作的详细描述, 本发明的其它特征、 目的和优点将会变得更明显:
图 1为根据本发明的一个方面的一种用于生成词条信息的方法流 程图;
图 2为根据本发明的一个优选实施例的用于生成词条信息的方法 流程图;
图 3为根据本发明的又一个优选实施例的用于生成词条信息的方 法流程图;
图 4为根据本发明的又一个优选实施例的用于生成词条信息的方 法流程图;
图 5为根据本发明的一个方面用于生成词条信息的词条生成装置 的结构示意图;
图 6为根据本发明的一个优选实施例的用于生成词条信息的词条 生成装置的结构示意图;
图 7为根据本发明的又一个优选实施例的用于生成词条信息的词 条生成装置的结构示意图;
图 8为根据本发明的又一个优选实施例的用于生成词条信息的词 条生成装置的结构示意图; 附图中相同或相似的附图标记代表相同或相似的部件。 具体实施方式
下面结合附图对本发明作进一步详细描述。
图 1示意出了根据本发明的一个方面的一种用于生成词条信息的 方法流程图。 根据本发明的方法包括步骤 Sl、 步骤 S2、 步骤 S3和步 骤 S4。
其中, 根据本发明的方法通过计算机设备实现。 所述计算机设备 包括一种能够按照事先设定或存储的指令,自动进行数值计算和 /或信 息处理的电子设备, 其硬件包括但不限于微处理器、 专用集成电路 (ASIC), 可编程门阵列 (FPGA )、 数字处理器 (DSP )、 嵌入式设备 等。 所述计算机设备包括网络设备和 /或用户设备。 其中, 所述用户设 备包括但不限于任何一种可与用户通过键盘、 鼠标、遥控器、触摸板、 或声控设备等方式进行人机交互的电子产品, 例如, 个人计算机、 平 板电脑、 智能手机、 PDA, 游戏机、 或 IPTV等。 其中, 所述用户设 备所处的网络包括但不限于互联网、 广域网、 城域网、 局域网、 VPN 网络等。
需要说明的是, 所述用户设备以及网络仅为举例, 其他现有的或 今后可能出现的用户设备以及网络如可适用于本发明, 也应包含在本 发明保护范围以内, 并以引用方式包含于此。
参照图 1 , 在步骤 S1中, 计算机设备获取候选词。
具体地, 所述获取候选词的方式包括但不限于以下任一种方式:
1 ) 实时获取用户输入的查询序列, 并将其作为候选词;
2 ) 由预获取的多个查询序列中选择一个作为候选词。
接着, 在步骤 S 2中, 计算机设备基于所述候选词进行搜索, 以获 取所述候选词的特征信息。
其中, 所述特征信息包括一项或多项文本信息。 其中, 所述文本 信息包括但不限于以下任一项:
a )词语信息; b )段落语言信息。
优选地, 所述特征信息包括一项或多项文本信息以及各项文本信 息的权重信息。
具体地, 所述计算机设备基于所述候选词进行搜索, 以获取所述 候选词的特征信息的方式包括但不限于以下任一项:
1 ) 计算机设备在包含多个候选词及其对应的特征信息的词库中 搜索, 以获得与步骤 S1中所获得的候选词对应的特征信息。
2 ) 计算机设备通过第一预定搜索引擎, 基于所述候选词执行搜 索, 以获取与所述候选词对应的一个或多个搜索结果网页; 接着, 计 算机设备根据所述一个或多个搜索结果网页, 来确定与所述候选词对 应的特征信息。
其中, 所述第一预定搜索引擎包括但不限于可基于候选词执行搜 索并获取一个或多个搜索结果网页的搜索引擎。
其中, 所述计算机设备根据所述一个或多个搜索结果网页, 来确 定与所述候选词对应的特征信息的方式包括但不限于以下任一项: a ) 获取所述一个或多个搜索结果网页中所包含的至少一个关 键词; 获取所述至少一个关键词中的各个关键词的权重信息; 基 于所获得的各个关键词及其相应的权重信息, 来确定与所述候选 词对应的特征信息。
其中, 所述权重信息根据以下至少任一项信息来确定: 1 ) 关键词在所述一个或多个搜索结果网页中的出现频率;
II ) 关键词在所述一个或多个搜索结果网页中的出现次数;
III ) 关键词在所述一个或多个搜索结果网页中出现的区域信 息, 例如, 出现在网页标题部分, 或者, 出现在网页内容部分等。
优选地, 所述权重信息基于各个关键词在所述一个或多个搜 索结果网页中的词频反文档频率(TF-IDF, term frequency-inverse document frequency) 值来确定。
具体地, 计算机设备对一个或多个搜索结果网页的网页内容 进行切词处理, 以获得至少一个关键词, 并统计该至少一个关键 词中的各个关键词的权重信息, 接着, 根据所获得的各个关键词 及其权重信息, 由该至少一个关键词中选择一个或多个关键词作 为与候选词对应的特征信息。
优选地, 计算机设备由与所述候选词对应的所有搜索结果网 页中选择一个或多个搜索结果网页, 并基于该所选择的搜索结果 网页来确定与该候选词对应的特征信息。
根据本发明的第一示例, 计算机设备在步骤 S1中获得的候选 词包括 "马尔代夫" , 并且, 计算机设备通过预定搜索引擎, 如 百度搜索引擎对 "马尔代夫" 进行搜索, 并获得多个搜索结果网 页。 计算机设备选择在搜索结果中排名前十位的搜索结果网页 webl至 weblO作为与该候选词对应的一个或多个搜索结果网页。 接着, 计算机设备对所选择的十个搜索结果网页的网页内容进行 切词以获得多个关键词, 并统计每个关键词相对于该十个搜索结 果网页的 TF-IDF值, 且将所获得的 TF-IDF值作为各个关键词的 权重信息; 并且, 计算机设备根据 TF-IDF值对各个关键词进行排 序后选择排名前二十位的关键词, 并将该排名前二十位的关键词 及其各自对应的 TF-IDF值作为候选词 "马尔代夫" 的特征信息。
b )通过预定主题确定模型, 根据所述一个或多个搜索结果网页 中的各个网页的网页内容, 来确定与所述一个或多个搜索结果网页 对应的主题相关信息; 接着, 基于所确定的主题相关信息来确定与 所述候选词对应的特征信息。
其中, 所述预定主题确定模型用于对给定的文本信息通过预定 模型来执行数据挖掘等操作, 以获得与所述文本信息对应的主题相 关信息。 例如, 潜在狄利克雷分配模型 (LDA, Latent Dirichlet Allocation ) 、 概率潜在语义分析模型(PLSA , Probabilistic Latent Semantic Analysis ) 、 带标签的潜在狄利克雷分配模型 (Labeled LDA , Labeled Latent Dirichlet Allocation)模型等。
其中, 所述主题相关信息包括用于表征所述文本信息的一个或 多个主题的信息, 例如, 用于表征所述文本信息的主题的多个关键 词等。
优选地, 所述主题相关信息还包括用于表征该一个或多个主题 在所述文本中的权重的信息, 例如, 与用于表征所述文本信息的主 题的多个关键词相对应的关键词权重等。
其中, 本领域技术人员应可根据实际情况和需求确定所采用的 主题模型, 以及通过主题模型获得一个或多个主题相关信息的方 ^, 匕 ϋ。
接着, 在步骤 S3 中, 计算机设备根据所述候选词的特征信息, 在多级分类索引信息中确定与所述候选词对应的分类索引。
其中, 所述多级分类索引信息包括多个基于预定拓朴结构相互关 联的分类索引,其中,各个分类索引分别对应至少一个分类相关网页。
其中, 确定多级分类索引信息的方式将在后续参照图 3所示的实 施例中予以详述, 并以引用的方式包含于此, 在此不再赞述。
具体地, 计算机设备获取所述候选词的特征信息与多级分类索引 信息中的各个分类索引的至少一个分类相关网页之间的相似度, 并基 于相似度来确定与所述候选词对应的分类索引。
接着, 在步骤 S4 中, 计算机设备根据与所述分类索引信息对应 的至少一个分类相关网页, 来确定与所述候选词对应的词条信息。
具体地, 计算机设备由与所述分类索引相对应的至少一个分类相 关网页中, 获取与所述候选词相关的网页内容, 以生成属于所述分类 索引的、 与所述候选词对应的词条信息。
其中, 计算机设备由至少一个分类相关网页中获取与候选词相关 的内容信息的方式包括:
计算机设备根据所述候选词及其特征信息, 由所述至少一个分类 相关网页中挖掘与所述候选词及其特征信息相对应的网页内容, 作为 与该候选词对应的词条信息的内容信息。
继续对前述第一示例进行说明, 多级分类索引信息包括如下表 1 所示的基于预定的树状拓朴结构相关联的分类索引: 表 1
Figure imgf000009_0001
并且, 每个分类索引均对应多个分类相关网页, 计算机设备在步 骤 S3 中确定与候选词 "马尔代夫" 对应的分类索引为 "境内游" , 则计算机设备从与分类索引 "境内游"对应的多个分类相关网页中获 取与候选词 "马尔代夫" 及其特征信息相关的网页内容, 并将其作为 与 "马尔代夫" 这一候选词对应的词条信息的内容, 以生成属于分类 索引 "出境游" 的、 与候选词 "马尔代夫" 对应的词条信息。
优选地, 当已存在属于所述分类索引的、 且与所述候选词对应的 词条信息时, 计算机设备由与所述分类索弓 I相对应的至少一个分类相 关网页中, 获取与所述候选词相关的内容信息, 以更新该候选词对应 的词条信息。
根据本发明的方法, 可自动由与候选词具有较高相似度的分类相 关中获取词条信息的内容, 从而极大的提高了词条信息的生成与更新 的效。 并且, 能够更加充分地挖掘并利用分类相关网页的内容。
图 2示意出了根据本发明的一个优选实施例的用于生成词条信息 的方法流程图。根据本实施例的方法包括步骤 S1至步骤 S4、步骤 S5、 步骤 S6以及步骤 S7。
其中,步骤 SI至步骤 S4已在参照图 1所示的实施例中予以详述, 并以引用的方式包含于此, 不再赘述。
在步骤 S5 中, 计算机设备获取与所述候选词对应的一项或多项 网络发布信息。
其中, 所述网络发布信息包括用于在互联网中发布的、 具有一定 的各类信息。 优选地, 所述网络发布信息包括广告。
其中, 所述计算机设备获取与所述候选词对应的一项或多项网络 发布信息的方式包括但不限于以下任一项:
1 ) 计算机设备通过在第二预定搜索引擎中查询所述候选词, 以 获取与所述候选词对应的一项或多项网络发布信息。
其中, 所述第二预定搜索引擎包括但不限于可基于候选词执行搜 索并获取一个或多个网络发布信息的搜索引擎。
优选地, 所述第二预定搜索引擎与前述参照图 1的实施例中所述 的第一预定搜索引擎为同一搜索引擎。
2 ) 计算机设备通过预定的各个候选词与网络发布信息的对应关 系, 来获取与该候选词对应的一项或多项网络发布信息。
接着, 在步骤 S6 中, 计算机设备根据所获得的一项或多项网络发 布信息来确定所述候选词的重要度信息。
具体地, 所述计算机设备根据所获得的一项或多项网络发布信息来 确定所述候选词的重要度信息的方式包括但不限于以下任一项:
1 )计算机设备统计所述候选词相对于所述一项或多项网络发布信息 的权重信息。
例如, 计算机设备统计所述候选词相对于其所对应的多项广告中的 TF-IDF值并将其作为候选词的重要度信息。
2 )计算机设备统计所述一项或多项网络发布信息数量, 并将其作为 所述候选词的重要度信息;
3 )计算机设备获取所述一项或多项网络发布信息的被使用信息, 并 根据所获得的被使用信息来确定所述候选词的重要度信息。 其中, 所 述网络发布信息的被使用信息包括但不限于以下至少任一项:
a ) 所述网络发布信息的 现次数;
b ) 所述网络发布信息的被点击次数等。
例如, 计算机设备统计候选词所对应的所有广告的被点击次数, 并 将其作为候选词的重要度信息; 又例如, 计算机设备统计候选词所对 应的广告的平均被点击次数, 以将其作为候选词的重要度信息等。
接着, 在步骤 S7 中, 计算机设备判断所述候选词的重要度信息是 否满足预定重要度条件。 其中, 所述预定重要度条件包括预定重要度阈值;
具体地, 计算机设备判断所述候选词的重要度信息是否满足预定阈 值。
接着, 根据本实施例的方法, 在步骤 S2 中, 当所述候选词的重要 度信息满足预定重要度条件时, 计算机设备获取所述候选词的特征信 息。
作为本实施例的优选方案之一, 所述步骤 S5 进一步包括步骤 S501 (图未示) 和步骤 S502 (图未示) , 所述步骤 S6进一步包括步 骤 S60 (图未示) 1和步骤 S602 (图未示) 。
在步骤 S501 中, 计算机设备对所述候选词进行切词以获取多个子 候选词。
在步骤 S502 中, 计算机设备通过第二预定搜索引擎, 基于各个子 候选词执行搜索以获取与各个子候选词对应的网络发布信息。
其中, 所述计算机设备通过第二预定搜索引擎, 基于各个子候选词 执行搜索以获取与各个子候选词对应的网络发布信息的方式与前述计 算机设备通过在第二预定搜索引擎中查询所述候选词, 以获取与所述 候选词对应的一项或多项网络发布信息的方式相同或相似, 故不再赘 述。
接着, 在步骤 S601 中, 计算机设备基于各个子候选词对应的网络 发布信息确定该子候选词的子重要度信息。
其中, 计算机设备基于各个子候选词对应的网络发布信息确定该子 候选词的子重要度信息的方式与前述计算机设备根据所获得的一项或 多项网络发布信息来确定所述候选词的重要度信息的方式相同或相 似, 故不再赘述。
在步骤 S602 中, 计算机设备基于各个子候选词的子重要度信息确 定所述候选词的重要度信息。
具体地, 所述计算机设备基于预定的统计规则, 确定各个子候 选词的子重要度信息。
优选地, 计算机设备基于预定的统计规则, 确定各个子候选词 的子重要度信息的方式包括但不限于以下任一种:
1 ) 计算机设备根据各个子候选词的子重要度信息, 确定平均重 要度信息, 并将其作为候选词的重要度信息。
2 ) 计算机设备获取各个子候选词相对于其所属的候选词的权重 值, 并基于各个子候选词的子重要度信息以及各个子候选词的权重 值, 来确定候选词的重要度信息。
例如, 基于各个子候选词在其所属的候选词中出现的次数来确 定各个子候选词的权重值, 并基于各个子候选词的子重要度信息以 及各个子候选词的权重值, 来确定候选词的重要度信息。
根据本实施例的方法, 仅对满足预定重要度条件的候选词来生 成词条, 提高了词条生成效率。
图 3示意出了根据本发明的又一个优选实施例的用于生成词条信 息的方法流程图。 根据本实施例的方法包括步骤 S1至步骤 S4、 步骤 S8以及步骤 S9。
其中,步骤 SI至步骤 S4已在参照图 1所示的实施例中予以详述, 并以引用的方式包含于此, 不再赘述。
在步骤 S8中, 计算机设备获取一个或多个网站的网页导航信息。 其中, 所述一个或多个网站可以为人工指定的具有一定相似度的 一个或多个网站, 也可以为通过对大量网站的网页内容执行聚类分析 后所确定的, 具有一定相似度的一个或多个网站。
其中, 所述网页导航信息包括但不限于基于网站中的网页栏目结 构, 为用户浏览网页提供提示的信息。
在步骤 S9 中, 计算机设备根据所获得的一个或多个网页导航信 息, 来生成多级分类索引信息, 其中, 所述多级分类索引中的各个分 类索引按照预定拓朴结构相互关联。
具体地, 计算机设备根据所获得的一个或多个网页导航信息, 来 生成多级分类索引信息的方式包括但不限于以下任一项:
1 ) 直接将所获得的网页导航信息转换为多级分类索引。
例如, 将网站的导航栏中的各个栏目作为分类索引, 并依次保存 各个栏目之间的所属关系, 以作为各个分类索引之间的所属关系, 以 生成多级分类索引。
2 ) 对多个网站的网页导航信息进行选择与合并, 并基于选择合 并后的结果来生成词条索引信息。
例如, 将该多个网站的导航栏中共同包含的一个或多个栏目作为 分类索引, 并选择其中一个网站导航栏中的各个栏目之间的所属关 系, 作为所获得的各个分类索引之间的所属关系的参考, 以生成多级 分类索引。
作为本实施例的优选方案之一, 根据本方案的方法还包括步骤 S10 (图未示) 、 步骤 S11 (图未示) , 所述步骤 S3进一步包括步骤 S301 (图未示 ) 。
在步骤 S10中,计算机设备基于与所述多级分类索引信息对应的 所述一个或多个网站的网页导航信息, 获取与该多级分类索引信息中 的各个分类索引分别对应的分类相关网页。
具体地, 计算机设备基于与所述多级分类索引信息对应的所述一 个或多个网站的网页导航信息, 确定分别与各个分类索引相对应的、 所述一个或多个网站的网页导航信息中的部分导航信息, 并获取所述 一个或多个网站中与该部分导航信息对应的至少一个站点网页, 作为 与所述分类索引相对应的分类相关网页。
接着, 在步骤 S1 1中, 计算机设备基于与所述各个分类索引相对 应的分类相关网页来确定与该各个分类索引分别对应的分类特征信 息。
其中, 计算机设备基于与所述各个分类索引相对应的分类相关网 页来确定与该各个分类索引分别对应的分类特征信息的方式与前述 参照图 1所示实施例的步骤 S2中, 计算机设备根据所述一个或多个 搜索结果网页, 来确定与所述候选词对应的特征信息的方式相同或相 似, 此处不再赘述。
接着, 在步骤 S301 中, 计算机设备基于所述候选词的特征信息 以及各个分类索引的分类特征信息, 确定与所述候选词对应的分类索 引。
具体地, 计算机设备将所述候选词的特征信息与各个分类索引的 分类特征信息进行比较, 并选择分类特征信息与候选词的特征信息的 相似度满足预定相似度条件的分类索引, 作为与所候选词对应的分类 索引。
其中, 所述预定相似度条件包括相似度满足预定相似度阈值。 作为本实施例的优选方案之一, , 所述预定拓朴结构包括多级的 拓朴结构, 其中相邻两级的分类索引之间为隶属关系, 其中, 所述步 骤 S3进一步包括步骤 S302 (图未示) 和步骤 S303 (图未示) 。
优选地, 所述预定拓朴结构包括多级的树状结构, 相邻的两级的 分类索引之间为隶属关系。
在步骤 S302 中, 计算机设备将所述候选词的特征信息与所述各 个分类索引的分类特征信息相比较, 以获取其分类特征信息与所述候 选词的特征信息相似的分类索引。
具体地,计算机设备根据所述预定拓朴结构,按照预定遍历顺序, 将所述候选词的特征信息逐个与所述各个分类索引的分类特征信息 相比较, 以获取其分类特征信息与所述候选词的特征信息相似的分类 索引。
例如, 当预定拓朴结构为树状结构, 并且预定遍历顺序为随机遍 历时, 随机获取尚未被遍历的分类索引, 并将该分类索引的分类特征 信息与候选词的特征信息相比较, 以获取其分类特征信息与所述候选 词的特征信息相似的分类索引。
又例如, 当预定拓朴结构为树状结构, 并且预定遍历顺序为从叶 结点逐层向上遍历时, 先获取作为各个叶结点的分类索引, 将该层的 分类索引的分类特征信息与候选词的特征信息相比较, 当未能在叶结 点中获得与所述候选词的特征信息相似的分类索引时,再获取各个叶 结点上一层的结点的分类索引, 并将该层的分类索引的分类特征信息 与候选词的特征信息相比较, 依次逐层往上, 直至获得与所述候选词 的特征信息相似的分类索引。 在步骤 S303 中, 当所获得的分类索引包含底层分类索引时, 计 算机设备将该底层分类索引作为所述候选词对应的分类索引。
具体地, 计算机设备判断所获得的分类索引是否为底层分类索 引, 并当所获得的分类索引包含底层分类索引时, 计算机设备将该底 层分类索引作为所述候选词对应的分类索引。
优选地, 根据本方案的方法, 所述步骤 S3还包括步骤 S304 (图 未示) 和步骤 S305 (图未示) 。
在步骤 S304 中, 当所获得的分类索引不包含底层索引节点时, 计算机设备基于其中最低级别的分类索引所对应的一个或多个分类 相关网页以及所述候选词, 来生成位于该最低级别的分类索引的下级 分类索引。
具体地, 计算机设备基于其中最低级别的分类索引所对应的一个 或多个分类相关网页以及所述候选词, 来生成位于该最低级别的分类 索引的下级分类索引的方式包括但不限于以下任一种:
1 )计算机设备基于候选词生成属于前述步骤 S302中获得的分类 索引的下级分类索引的名称, 并基于候选词所对应的搜索结果页面以 及所获得的分类索引所对应的分类相关页面, 确定与该下级分类索 ^ I 相对应的分类相关网页。
2 )计算机设备基于前述步骤 S302中获得的分类索引对应的一个 或多个站点网页, 在该一个或多个站点网页中查询并获取与候选词相 关的至少一个网页, 并确定与所该网页对应的中心词, 以将其作为前 述步骤 S302 中获得的分类索引的下级分类索引的名称, 并将该至少 一个网页作为与该下级分类索引对应的分类相关网页。
接着, 在步骤 S305 中, 计算机设备将所生成的底层分类索引作 为与所述候选词对应的分类索引。
根据本实施例的方法, 通过获取一个或多个网站的网站导航信息来 建立多级分类索引, 从而使得词条的分类索引体系与实际使用中的体 系相近, 有利于更加全面的挖掘专业网站的内容信息, 并且由于同时 还可利用这些网站的网页内容作为分类索引的分类相关网页, 故能够 为候选词生成能够有更加系统、 完整的词条信 , 。
图 4示意出了根据本发明的又一优选实施例的用于生成词条信息 的方法流程图。根据本实施例的方法包括步骤 S 1至步骤 S4、步骤 S 12、 步骤 S13、 步骤 S14以及步骤 S15。
其中,步骤 SI至步骤 S4已在参照图 1所示的实施例中予以详述, 并以引用的方式包含于此, 不再赘述。
在步骤 S12中, 计算机设备获取候选网站的一个或多个网页。
其中, 计算机设备确定候选网站的方式包括但不限于以下任一种:
1 )获取人工置顶的网站作为候选网站;
2 )将抓取到的网站页面与多级分类索引信息中的各个分类索引所对 应的网页进行比较, 以获得站点网页与所述各个分类索引所对应的网 页相似的网站。
接着, 在步骤 S13中, 计算机设备根据所述候选网站的一个或多个 网页, 确定该候选网站的站点特征信息。
其中, 计算机设备根据所述候选网站的一个或多个网页, 确定该候 选网站的站点特征信息的方式与前述参照图 1 所示实施例中的步骤 S2 中计算机设备根据所述一个或多个搜索结果网页, 来确定与所述候选 词对应的特征信息的方式相同或相似, 在此不再赘述。
接着, 在步骤 S14中, 计算机设备将所述候选网站的站点特征信息 与各个分类索引的分类特征信息进行比较, 以确定与该候选网站对应 的一个或多个分类索引。
其中, 计算机设备将所述候选网站的站点特征信息与各个分类索引 的分类特征信息进行比较, 以确定与该候选网站对应的一个或多个分 类索引的方式与前述参照图 3所示实施例的步骤 S302中, 计算机设备 将所述候选词的特征信息与所述各个分类索引的分类特征信息相比 较, 以获取其分类特征信息与所述候选词的特征信息相似的分类索引 的方式相同或相似, 在此不再赘述。
接着, 在步骤 S15中, 计算机设备向该候选网站对应的候选用户提 供该一个或多个分类索弓 I分别对应的一个或多个候选词。 作为本实施力的优选方案, 根据本实施例的方法还包括步骤 S16 (图未示) 、 步骤 S17 (图未示) 以及步骤 S18 (图未示) 。
在步骤 S16中, 计算机设备根据与所述候选网站对应的一个或多个 分类索引, 获取所述候选网站中与该一个或多个分类索引分别对应的 一个或多个候选网页。
其中, 所述计算机设备根据与所述候选网站对应的一个或多个分类 索引, 获取所述候选网站中与该一个或多个分类索引分别对应的一个 或多个候选网页的方式包括但不限于以下任一种:
1 )计算机设备获取该一个或多个分类索引的分类相关网页, 将所获 得的分类相关网页与所述候选网站的站点网页进行比较, 以获得与所 述分类相关网页相似的一个或多个站点网页, 并将其作为与该分类相 关网页所对应的分类索引的候选网页。
2 )计算机设备才艮据该一个或多个分类索引的分类特征信息, 由候选 网站中获取分别与该一个或多个分类索引的分类特征信息相似的一个 或多个候选网页。
接着, 在步骤 S17中, 计算机设备基于与各个分类索引对应的、 所 述候选网站中的一个或多个候选网页, 确定或更新与该各个分类索 ^ I 对应的分类相关网页。
具体地, 计算机设备将所确定的候选网页作为与分类索引对应的分 类相关网页添加至与各个分类索引对应的分类相关网页库中。
在步骤 S18中, 计算机设备基于所述更新后的与各个分类索引对应 的分类相关网页, 更新各个分类索 ^ I所对应的候选词的词条信息。
具体地, 对属于个各个分类索引的一个或多个候选词, 分别采 用更新后的该分类索引的分类相关网页来更新各个候选词的词条内 容。 其中, 采用更新后的该分类索引的分类相关网页来更新各个候 选词的词条内容的方式与前述参照图 1所示实施例中的步骤 S4 中计 算机设备根据与所述分类索引信息对应的至少一个分类相关网页, 来 确定与所述候选词对应的词条信息的方式相同或相似, 此处不再赞 述。 根据本实施例的方法, 通过采用候选网站的内容来自动更新词条信 息, 使得词条内容能够尽快得到更新, 并且提高了更新效率。
图 5示意出了根据本发明的一个方面用于生成词条信息的词条生 成装置的结构示意图。 根据本发明的词条生成装置包括第一获取装 置 1、 第二获取装置 2、 第一确定装置 3和第一生成装置 4。
参照图 5 , 第一获取装置 1获取候选词。
具体地, 所述获取候选词的方式包括但不限于以下任一种方式:
1 ) 实时获取用户输入的查询序列, 并将其作为候选词;
2 ) 由预获取的多个查询序列中选择一个作为候选词。
接着, 第二获取装置 2 基于所述候选词进行搜索, 以获取所述候 选词的特征信息。
其中, 所述特征信息包括一项或多项文本信息。 其中, 所述文本 信息包括但不限于以下任一项:
a )词语信息;
b )段落语言信息。
优选地, 所述特征信息包括一项或多项文本信息以及各项文本信 息的权重信息。
具体地, 所述第二获取装置 2基于所述候选词进行搜索, 以获取 所述候选词的特征信息的方式包括但不限于以下任一项:
1 ) 第二获取装置 2在包含多个候选词及其对应的特征信息的词 库中搜索, 以获得与步骤 S1中所获得的候选词对应的特征信息。
2 ) 第二获取装置 2 中的第一搜索装置 (图未示) 的通过第一预定 搜索引擎, 基于所述候选词执行搜索, 以获取与所述候选词对应的一 个或多个搜索结果网页; 接着, 第二获取装置 2中的第二确定装置(图 未示)才艮据所述一个或多个搜索结果网页, 来确定与所述候选词对应的 特征信息。
其中, 所述第一预定搜索引擎包括但不限于可基于候选词执行搜 索并获取一个或多个搜索结果网页的搜索引擎。
其中, 所述第二确定装置根据所述一个或多个搜索结果网页, 来 确定与所述候选词对应的特征信息的方式包括但不限于以下任一项: a ) 第二确定装置中的关键词获取装置 (图未示) 获取所述一 个或多个搜索结果网页中所包含的至少一个关键词; 接着, 第二 确定装置中的权重获取装置 (图未示) 获取所述至少一个关键词 中的各个关键词的权重信息; 接着, 第二确定装置中的第一子确 定装置 (图未示)基于所获得的各个关键词及其相应的权重信息, 来确定与所述候选词对应的特征信息。
其中, 所述权重信息根据以下至少任一项信息来确定: 1 ) 关键词在所述一个或多个搜索结果网页中的出现频率;
II ) 关键词在所述一个或多个搜索结果网页中的出现次数;
III ) 关键词在所述一个或多个搜索结果网页中出现的区域信 息, 例如, 出现在网页标题部分, 或者, 出现在网页内容部分等。
优选地, 所述权重信息基于各个关键词在所述一个或多个搜 索结果网页中的词频反文档频率(TF-IDF, term frequency-inverse document frequency) 值来确定。
具体地, 关键词获取装置对一个或多个搜索结果网页的网页 内容进行切词处理, 以获得至少一个关键词, 接着, 权重获取装 置统计并确定该至少一个关键词中的各个关键词的权重信息, 接 着, 第一子确定装置根据所获得的各个关键词及其权重信息, 由 该至少一个关键词中选择一个或多个关键词作为与候选词对应的 特征信息。
优选地, 计算机设备由与所述候选词对应的所有搜索结果网 页中选择一个或多个搜索结果网页, 并基于该所选择的搜索结果 网页来确定与该候选词对应的特征信息。
根据本发明的第一示例, 第一获取装置 1 获得的候选词包括 "马尔代夫" , 并且, 第一搜索装置通过预定搜索引擎, 如百度 搜索引擎对 "马尔代夫" 进行搜索以获得多个搜索结果网页, 并 选择在搜索结果中排名前十位的搜索结果网页 webl至 webl 0作为 与该候选词对应的一个或多个搜索结果网页。 接着, 关键词获取 装置权重获取装置对所选择的十个搜索结果网页的网页内容进行 切词以获得多个关键词, 由权重获取装置统计每个关键词相对于 该十个搜索结果网页的 TF-IDF值, 将所获得的 TF-IDF值作为各 个关键词的权重信息; 然后, 第一子确定装置根据 TF-IDF值对各 个关键词进行排序后选择排名前二十位的关键词, 并将该排名前 二十位的关键词及其各自对应的 TF-IDF值作为候选词 "马尔代夫" 的特征信息。
b ) 第二确定装置中的模型确定装置 (图未示)通过预定主题确 定模型, 根据所述一个或多个搜索结果网页中的各个网页的网页内 容, 来确定与所述一个或多个搜索结果网页对应的主题相关信息; 接着, 第二确定装置中的第二子确定装置(图未示)基于所确定的主 题相关信息来确定与所述候选词对应的特征信息。
其中, 所述预定主题确定模型用于对给定的文本信息通过预定 模型来执行数据挖掘等操作, 以获得与所述文本信息对应的主题相 关信息。 例如, 潜在狄利克雷分配模型 (LDA, Latent Dirichlet Allocation ) 、 概率潜在语义分析模型(PLSA , Probabilistic Latent Semantic Analysis ) 、 带标签的潜在狄利克雷分配模型 (Labeled LDA , Labeled Latent Dirichlet Allocation)模型等。
其中, 所述主题相关信息包括用于表征所述文本信息的一个或 多个主题的信息, 例如, 用于表征所述文本信息的主题的多个关键 词等。
优选地, 所述主题相关信息还包括用于表征该一个或多个主题 在所述文本中的权重的信息, 例如, 与用于表征所述文本信息的主 题的多个关键词相对应的关键词权重等。
其中, 本领域技术人员应可根据实际情况和需求确定所采用的 主题模型, 以及通过主题模型获得一个或多个主题相关信息的方 ^, 匕 f ϋ。
接着, 第一确定装置 3根据所述候选词的特征信息, 在多级分类 索引信息中确定与所述候选词对应的分类索引。 其中, 所述多级分类索引信息包括多个基于预定拓朴结构相互关 联的分类索引,其中,各个分类索引分别对应至少一个分类相关网页。
其中, 确定多级分类索引信息的方式将在后续参照图 3所示的实 施例中予以详述, 并以引用的方式包含于此, 在此不再赞述。
具体地, 第一确定装置 3获取所述候选词的特征信息与多级分类 索引信息中的各个分类索引的至少一个分类相关网页之间的相似度, 并基于相似度来确定与所述候选词对应的分类索引。
接着, 第一生成装置 4根据与所述分类索引信息对应的至少一个 分类相关网页, 来确定与所述候选词对应的词条信息。
具体地, 第一生成装置 4由与所述分类索引相对应的至少一个分 类相关网页中, 获取与所述候选词相关的网页内容, 以生成属于所述 分类索引的、 与所述候选词对应的词条信息。
其中, 第一生成装置 4由至少一个分类相关网页中获取与候选词 相关的内容信息的方式包括:
第一生成装置 4根据所述候选词及其特征信息, 由所述至少一个 分类相关网页中挖掘与所述候选词及其特征信息相对应的网页内容, 作为与该候选词对应的词条信息的内容信息。
继续对前述第一示例进行说明, 多级分类索引信息包括如下表 2 所示的基于预定的树状拓朴结构相关联的分类索引:
表 2
Figure imgf000021_0001
并且, 每个分类索引均对应多个分类相关网页, 第一确定装置 3 确定与候选词 "马尔代夫" 对应的分类索引为 "境内游" , 则第一生 成装置 4从与分类索引 "境内游"对应的多个分类相关网页中获取与 候选词 "马尔代夫"及其特征信息相关的网页内容,并将其作为与 "马 尔代夫 "这一候选词对应的词条信息的内容,以生成属于分类索引 "出 境游" 的、 与候选词 "马尔代夫" 对应的词条信息。
优选地, 当已存在属于所述分类索引的、 且与所述候选词对应的 词条信息时, 计算机设备由与所述分类索弓 I相对应的至少一个分类相 关网页中, 获取与所述候选词相关的内容信息, 以更新该候选词对应 的词条信息。
根据本发明的方案, 可自动由与候选词具有较高相似度的分类相 关中获取词条信息的内容, 从而极大的提高了词条信息的生成与更新 的效。 并且, 能够更加充分地挖掘并利用分类相关网页的内容。
图 6示意出了根据本发明的一个优选实施例的用于生成词条信息 的词条生成装置的结构示意图。根据本实施例的词条生成装置包括第 一获取装置 1、 第二获取装置 2、 第一确定装置 3、 第一生成装置 4、 第三获取装置 5、 第三确定装置 6以及判断装置 7。
其中, 第一获取装置 1、 第二获取装置 2、 第一确定装置 3 以及 第一生成装置 4已在参照图 5所示的实施例中予以详述, 并以引用的 方式包含于此, 不再赘述。
第三获取装置 5获取与所述候选词对应的一项或多项网络发布信 自、
其中, 所述网络发布信息包括用于在互联网中发布的、 具有一定 的各类信息。 优选地, 所述网络发布信息包括广告。
其中, 所述第三获取装置 5获取与所述候选词对应的一项或多项 网络发布信息的方式包括但不限于以下任一项:
1 )第三获取装置 5通过在第二预定搜索引擎中查询所述候选词, 以获取与所述候选词对应的一项或多项网络发布信息。
其中, 所述第二预定搜索引擎包括但不限于可基于候选词执行搜 索并获取一个或多个网络发布信息的搜索引擎。
优选地, 所述第二预定搜索引擎与前述参照图 5的实施例中所述 的第一预定搜索引擎为同一搜索引擎。
2 ) 第三获取装置 5通过预定的各个候选词与网络发布信息的对 应关系, 来获取与该候选词对应的一项或多项网络发布信息。
接着, 第三确定装置 6根据所获得的一项或多项网络发布信息来确 定所述候选词的重要度信息。
具体地, 所述第三确定装置 6根据所获得的一项或多项网络发布信 息来确定所述候选词的重要度信息的方式包括但不限于以下任一项:
1 ) 第三确定装置 6统计所述候选词相对于所述一项或多项网络发 布信息的权重信息。
例如, 第三确定装置 6统计所述候选词相对于其所对应的多项广告 中的 TF-IDF值并将其作为候选词的重要度信息。
2 ) 第三确定装置 6统计所述一项或多项网络发布信息数量, 并将 其作为所述候选词的重要度信息;
3 ) 第三确定装置 6获取所述一项或多项网络发布信息的被使用信 息, 并根据所获得的被使用信息来确定所述候选词的重要度信息。 其 中, 所述网络发布信息的被使用信息包括但不限于以下至少任一项: a ) 所述网络发布信息的 现次数;
b ) 所述网络发布信息的被点击次数等。
例如, 第三确定装置 6 统计候选词所对应的所有广告的被点击次 数, 并将其作为候选词的重要度信息; 又例如, 第三确定装置 6 统计 候选词所对应的广告的平均被点击次数, 以将其作为候选词的重要度 信息等。
接着, 判断装置 Ί判断所述候选词的重要度信息是否满足预定重要 度条件。
其中, 所述预定重要度条件包括预定重要度阈值;
具体地, 判断装置 7判断所述候选词的重要度信息是否满足预定阈 值。
接着, 根据本实施例的方案, 当所述候选词的重要度信息满足预定 重要度条件时, 第二获取装置 2获取所述候选词的特征信息。
作为本实施例的优选方案之一, 所述第三获取装置 5进一步包括第 一子获取装置 (图未示)和第二搜索装置 (图未示) , 所述第三确定装 置进一步包括第三子确定装置(图未示)和第四子确定装置(图未示)。 第一子获取装置对所述候选词进行切词以获取多个子候选词。
第二搜索装置通过第二预定搜索引擎, 基于各个子候选词执行搜索 以获取与各个子候选词对应的网络发布信息。
其中, 所述第二搜索装置通过第二预定搜索引擎, 基于各个子候选 词执行搜索以获取与各个子候选词对应的网络发布信息的方式与前述 第三获取装置 5 通过在第二预定搜索引擎中查询所述候选词, 以获取 与所述候选词对应的一项或多项网络发布信息的方式相同或相似, 故 不再赘述。
接着, 第三子确定装置基于各个子候选词对应的网络发布信息确定 该子候选词的子重要度信息。
其中, 第三子确定装置基于各个子候选词对应的网络发布信息确定 该子候选词的子重要度信息的方式与前述计算机设备根据所获得的一 项或多项网络发布信息来确定所述候选词的重要度信息的方式相同或 相似, 故不再赘述。
第四子确定装置基于各个子候选词的子重要度信息确定所述候选词 的重要度信息。
具体地, 所述第四子确定装置基于预定的统计规则, 确定各个子 候选词的子重要度信息。
优选地, 第四子确定装置基于预定的统计规则, 确定各个子候选 词的子重要度信息的方式包括但不限于以下任一种:
1 )第四子确定装置根据各个子候选词的子重要度信息, 确定平均 重要度信息, 并将其作为候选词的重要度信息。
2 )第四子确定装置获取各个子候选词相对于其所属的候选词的权 重值, 并基于各个子候选词的子重要度信息以及各个子候选词的权 重值, 来确定候选词的重要度信息。
例如, 第四子确定装置基于各个子候选词在其所属的候选词中出 现的次数来确定各个子候选词的权重值, 并基于各个子候选词的子 重要度信息以及各个子候选词的权重值, 来确定候选词的重要度信 根据本实施例的方案, 仅对满足预定重要度条件的候选词来生 成词条, 提高了词条生成效率。
图 7示意出了根据本发明的又一个优选实施例的用于生成词条信 息的词条生成装置的结构示意图。根据本实施例的词条生成装置包括 第一获取装置 1、第二获取装置 2、第一确定装置 3、第一生成装置 4、 导航获取装置 8以及第二生成装置 9。
其中, 第一获取装置 1、 第二获取装置 2、 第一确定装置 3 以及 第一生成装置 4已在参照图 5所示的实施例中予以详述, 并以引用的 方式包含于此, 不再赘述。
导航获取装置 8获取一个或多个网站的网页导航信息。
其中, 所述一个或多个网站可以为人工指定的具有一定相似度的 一个或多个网站, 也可以为通过对大量网站的网页内容执行聚类分析 后所确定的, 具有一定相似度的一个或多个网站。
其中, 所述网页导航信息包括但不限于基于网站中的网页栏目结 构, 为用户浏览网页提供提示的信息。
第二生成装置 9根据所获得的一个或多个网页导航信息, 来生成 多级分类索引信息, 其中, 所述多级分类索引中的各个分类索引按照 预定拓朴结构相互关联。
具体地,第二生成装置 9根据所获得的一个或多个网页导航信息, 来生成多级分类索引信息的方式包括但不限于以下任一项:
1 ) 第二生成装置 9直接将所获得的网页导航信息转换为多级分 类索引。
例如, 第二生成装置 9将网站的导航栏中的各个栏目作为分类索 引, 并依次保存各个栏目之间的所属关系, 以作为各个分类索引之间 的所属关系, 以生成多级分类索引。
2 )第二生成装置 9对多个网站的网页导航信息进行选择与合并, 并基于选择合并后的结果来生成词条索引信息。
例如, 第二生成装置 9将该多个网站的导航栏中共同包含的一个 或多个栏目作为分类索引, 并选择其中一个网站导航栏中的各个栏目 之间的所属关系, 作为所获得的各个分类索引之间的所属关系的参 考, 以生成多级分类索引。
作为本实施例的优选方案之一,根据本方案的词条生成装置还包 括第四获取装置 (图未示) 、 第一特征确定装置 (图未示) 。
第四获取装置基于与所述多级分类索引信息对应的所述一个或 多个网站的网页导航信息, 获取与该多级分类索引信息中的各个分类 索引分别对应的分类相关网页。
具体地, 第四获取装置基于与所述多级分类索引信息对应的所述 一个或多个网站的网页导航信息, 确定分别与各个分类索引相对应 的、 所述一个或多个网站的网页导航信息中的部分导航信息, 并获取 所述一个或多个网站中与该部分导航信息对应的至少一个站点网页, 作为与所述分类索引相对应的分类相关网页。
接着, 第一特征确定装置基于与所述各个分类索引相对应的分类 相关网页来确定与该各个分类索 ^ I分别对应的分类特征信息。
其中, 第一特征确定装置基于与所述各个分类索引相对应的分类 相关网页来确定与该各个分类索引分别对应的分类特征信息的方式 与前述参照图 5所示实施例中第二确定装置根据所述一个或多个搜索 结果网页, 来确定与所述候选词对应的特征信息的方式相同或相似, 此处不再赞述。
接着,根据本实施例的第一确定装置 3基于所述候选词的特征信 息以及各个分类索引的分类特征信息, 确定与所述候选词对应的分类 索引。
具体地, 第一确定装置 3将所述候选词的特征信息与各个分类索 引的分类特征信息进行比较, 并选择分类特征信息与候选词的特征信 息的相似度满足预定相似度条件的分类索引, 作为与所候选词对应的 分类索引。
其中, 所述预定相似度条件包括相似度满足预定相似度阈值。 作为本实施例的优选方案之一, , 所述预定拓朴结构包括多级的 拓朴结构, 其中相邻两级的分类索引之间为隶属关系, 其中, 所述第 一确定装置 3进一步包括比较获取装置(图未示)和第一分类确定装 置 (图未示) 。
优选地, 所述预定拓朴结构包括多级的树状结构, 相邻的两级的 分类索引之间为隶属关系。
比较获取装置将所述候选词的特征信息与所述各个分类索引的 分类特征信息相比较, 以获取其分类特征信息与所述候选词的特征信 息相似的分类索引。
具体地, 比较获取装置根据所述预定拓朴结构, 按照预定遍历顺 序, 将所述候选词的特征信息逐个与所述各个分类索引的分类特征信 息相比较, 以获取其分类特征信息与所述候选词的特征信息相似的分 类索引。
例如, 当预定拓朴结构为树状结构, 并且预定遍历顺序为随机遍 历时, 随机获取尚未被遍历的分类索引, 并将该分类索引的分类特征 信息与候选词的特征信息相比较, 以获取其分类特征信息与所述候选 词的特征信息相似的分类索引。
又例如, 当预定拓朴结构为树状结构, 并且预定遍历顺序为从叶 结点逐层向上遍历时, 先获取作为各个叶结点的分类索引, 将该层的 分类索引的分类特征信息与候选词的特征信息相比较, 当未能在叶结 点中获得与所述候选词的特征信息相似的分类索引时,再获取各个叶 结点上一层的结点的分类索引, 并将该层的分类索引的分类特征信息 与候选词的特征信息相比较, 依次逐层往上, 直至获得与所述候选词 的特征信息相似的分类索引。
当所获得的分类索引包含底层分类索引时, 第一分类确定装置将 该底层分类索引作为所述候选词对应的分类索引。
具体地, 第一分类确定装置判断所获得的分类索引是否为底层分 类索引, 并当所获得的分类索引包含底层分类索引时, 第一分类确定 装置将该底层分类索引作为所述候选词对应的分类索引。
优选地, 根据本方案的词条生成装置中, 所述第一确定装置 3还 包括第三生成装置 (图未示) 和第二分类确定装置 (图未示) 。
当所获得的分类索引不包含底层索引节点时, 第三生成装置基于 其中最低级别的分类索引所对应的一个或多个分类相关网页以及所 述候选词, 来生成位于该最低级别的分类索引的下级分类索引。
具体地, 第三生成装置基于其中最低级别的分类索引所对应的一 个或多个分类相关网页以及所述候选词, 来生成位于该最低级别的分 类索引的下级分类索引的方式包括但不限于以下任一种:
1 ) 第三生成装置基于候选词生成属于由前述第一分类确定装置 所获得的分类索引的下级分类索引的名称, 并基于候选词所对应的搜 索结果页面以及所获得的分类索引所对应的分类相关页面, 确定与该 下级分类索引相对应的分类相关网页。
2 ) 第三生成装置基于前述第一分类确定装置所获得的分类索引 对应的一个或多个站点网页, 在该一个或多个站点网页中查询并获取 与候选词相关的至少一个网页, 并确定与所该网页对应的中心词, 以 将其作为前述第一分类确定装置获得的分类索引的下级分类索引的 名称, 并将该至少一个网页作为与该下级分类索引对应的分类相关网 页。
接着, 第二分类确定装置将所生成的底层分类索引作为与所述候 选词对应的分类索引。
根据本实施例的方案, 通过获取一个或多个网站的网站导航信息 来建立多级分类索引, 从而使得词条的分类索引体系与实际使用中的 体系相近, 有利于更加全面的挖掘专业网站的内容信息, 并且由于同 时还可利用这些网站的网页内容作为分类索引的分类相关网页,故能 够为候选词生成能够有更加系统、 完整的词条信息。
图 8示意出了根据本发明的又一优选实施例的用于生成词条信息 的词条生成装置的结构示意图。根据本实施例的词条生成装置包括第 一获取装置 1、 第二获取装置 2、 第一确定装置 3、 第一生成装置 4、 第一网页获取装置 10、 第二特征确定装置 11、 第三分类确定装置 12 以及提供装置 13。 其中, 第一获取装置 1、 第二获取装置 2、 第一确定装置 3 以及 第一生成装置 4已在参照图 5所示的实施例中予以详述, 并以引用的 方式包含于此, 不再赘述。
第一网页获取装置 10获取候选网站的一个或多个网页。
其中, 第一网页获取装置 10确定候选网站的方式包括但不限于以 下任一种:
1 ) 第一网页获取装置 10获取人工指定的网站作为候选网站;
2 )第一网页获取装置 10将抓取到的网站页面与多级分类索引信息 中的各个分类索引所对应的网页进行比较, 以获得站点网页与所述各 个分类索弓 I所对应的网页相似的网站。
接着, 第二特征确定装置 11 根据所述候选网站的一个或多个网 页, 确定该候选网站的站点特征信息。
其中, 第二特征确定装置 11 根据所述候选网站的一个或多个网 页, 确定该候选网站的站点特征信息的方式与前述参照图 5所示实施例 中第二确定装置根据所述一个或多个搜索结果网页, 来确定与所述候 选词对应的特征信息的方式相同或相似, 在此不再赞述。
接着, 第三分类确定装置 12将所述候选网站的站点特征信息与各 个分类索引的分类特征信息进行比较, 以确定与该候选网站对应的一 个或多个分类索引。
其中, 第三分类确定装置 12将所述候选网站的站点特征信息与各 个分类索引的分类特征信息进行比较, 以确定与该候选网站对应的一 个或多个分类索引的方式与前述参照图 7所示实施例中比较确定装置将 所述候选词的特征信息与所述各个分类索引的分类特征信息相比较, 以获取其分类特征信息与所述候选词的特征信息相似的分类索引的方 式相同或相似, 在此不再赘述。
接着, 提供装置 13 向该候选网站对应的候选用户提供该一个或多 个分类索弓 I分别对应的一个或多个候选词。
作为本实施力的优选方案, 才艮据本实施例的词条生成装置还包括第 二网页获取装置 (图未示) 、 第一更新装置 (图未示) 以及第一更新装 置 (图未示) 。
第二网页获取装置根据与所述候选网站对应的一个或多个分类索 引, 获取所述候选网站中与该一个或多个分类索引分别对应的一个或 多个候选网页。
其中, 所述第二网页获取装置根据与所述候选网站对应的一个或多 个分类索引, 获取所述候选网站中与该一个或多个分类索引分别对应 的一个或多个候选网页的方式包括但不限于以下任一种:
1 )第二网页获取装置获取该一个或多个分类索引的分类相关网页, 将所获得的分类相关网页与所述候选网站的站点网页进行比较, 以获 得与所述分类相关网页相似的一个或多个站点网页, 并将其作为与该 分类相关网页所对应的分类索引的候选网页。
2 )第二网页获取装置根据该一个或多个分类索引的分类特征信息, 由候选网站中获取分别与该一个或多个分类索引的分类特征信息相似 的一个或多个候选网页。
接着, 第一更新装置基于与各个分类索引对应的、 所述候选网站中 的一个或多个候选网页, 确定或更新与该各个分类索弓 I对应的分类相 关网页。
具体地, 第一更新装置将所确定的候选网页作为与分类索引对应的 分类相关网页添加至与各个分类索引对应的分类相关网页库中。
第一更新装置基于所述更新后的与各个分类索引对应的分类相关网 页, 更新各个分类索引所对应的候选词的词条信息。
具体地, 第一更新装置对属于个各个分类索引的一个或多个候选 词, 分别采用更新后的该分类索引的分类相关网页来更新各个候选 词的词条内容。
其中, 第一更新装置采用更新后的该分类索引的分类相关网页来 更新各个候选词的词条内容的方式与前述参照图 5所示实施例中第一 生成装置根据与所述分类索引信息对应的至少一个分类相关网页, 来 确定与所述候选词对应的词条信息的方式相同或相似, 此处不再赞 述。 才艮据本实施例的方案, 通过采用候选网站的内容来自动更新词条信 息, 使得词条内容能够尽快得到更新, 并且提高了更新效率。
本发明的软件程序可以通过处理器执行以实现上文所述步骤或 功能。 同样地, 本发明的软件程序 (包括相关的数据结构)可以被存 储到计算机可读记录介质中, 例如, RAM存储器, 磁或光驱动器或 软磁盘及类似设备。 另外, 本发明的一些步骤或功能可采用硬件来实 现, 例如, 作为与处理器配合从而执行各个功能或步骤的电路。
另外, 本发明的一部分可被应用为计算机程序产品, 例如计算机 程序指令, 当其被计算机执行时, 通过该计算机的操作, 可以调用或 提供根据本发明的方法和 /或技术方案。而调用本发明的方法的程序指 令,可能被存储在固定的或可移动的记录介质中,和 /或通过广播或其 他信号承载媒体中的数据流而被传输,和 /或被存储在根据所述程序指 令运行的计算机设备的工作存储器中。 在此, 根据本发明的一个实施 例包括一个装置, 该装置包括用于存储计算机程序指令的存储器和用 于执行程序指令的处理器, 其中, 当该计算机程序指令被该处理器执 行时, 触发该装置运行基于前述根据本发明的多个实施例的方法和 / 或技术方案。
对于本领域技术人员而言, 显然本发明不限于上述示范性实施例 的细节, 而且在不背离本发明的精神或基本特征的情况下, 能够以其 他的具体形式实现本发明。 因此, 无论从哪一点来看, 均应将实施例 看作是示范性的, 而且是非限制性的, 本发明的范围由所附权利要求 而不是上述说明限定, 因此旨在将落在权利要求的等同要件的含义和 范围内的所有变化涵括在本发明内。 不应将权利要求中的任何附图标 记视为限制所涉及的权利要求。 此外, 显然"包括"一词不排除其他单 元或步骤, 单数不排除复数。 系统权利要求中陈述的多个单元或装置 也可以由一个单元或装置通过软件或者硬件来实现。 第一, 第二等词 语用来表示名称, 而并不表示任何特定的顺序。

Claims

权 利 要 求 书
1. 一种用于生成词条信息的方法, 其中, 所述方法包括以下步 骤:
a获取候选词 ^
b基于所述候选词进行搜索, 以获取所述候选词的特征信息; c才艮据所述候选词的特征信息, 在多级分类索引信息中确定与所述 候选词对应的分类索引; 其中, 所述分类索引对应至少一个分类相关 网页;
d才艮据与所述分类索引信息对应的至少一个分类相关网页, 来确定 与所述候选词对应的词条信息。
2. 根据权利要求 1所述的方法, 其中, 所述步骤 b包括以下步骤: bl 通过第一预定搜索引擎, 基于所述候选词执行搜索, 以获取与 所述候选词对应的一个或多个搜索结果网页;
b2根据所述一个或多个搜索结果网页, 来确定与所述候选词对应 的特征信息。
3. 根据权利要求 2 所述的方法, 其中, 所述步骤 b2 包括以下步 骤:
- 获取所述一个或多个搜索结果网页中所包含的至少一个关键词; - 获取所述至少一个关键词中的各个关键词的权重信息;
-基于所获得的各个关键词及其相应的权重信息, 来确定与所述候 选词对应的特征信息。
4. 根据权利要求 2 所述的方法, 其中, 所述步骤 b2 包括以下步 骤:
- 通过预定主题确定模型, 根据所述一个或多个搜索结果网页中的 各个网页的网页内容, 来确定与所述一个或多个搜索结果网页对应的 主题相关信息;
-基于所确定主题相关信息来确定与所述候选词对应的特征信息。
5. 根据权利要求 1至 3中任一项所述的方法, 其中, 所述方法还包 括以下步骤:
X获取与所述候选词对应的一项或多项网络发布信息;
y根据所获得的一项或多项网络发布信息来确定所述候选词的重要 度信息;
其中, 所述方法还包括以下步骤:
判断所述候选词的重要度信息是否满足预定重要度条件;
其中, 所述步骤 b包括以下步骤:
- 当所述候选词的重要度信息满足预定重要度条件时, 获取所述候 选词的特征信息。
6. 根据权利要求 5所述的方法, 其中, 所述步骤 X包括以下步骤: -对所述候选词进行切词以获取多个子候选词;
- 通过第二预定搜索引擎, 基于各个子候选词执行搜索以获取与各 个子候选词对应的网络发布信息;
其中, 所述步骤 y包括以下步骤:
-基于各个子候选词对应的网络发布信息确定该子候选词的子重要 度信息;
- 基于各个子候选词的子重要度信息确定所述候选词的重要度信 息。
7. 根据权利要求 1至 6中任一项所述的方法, 其中, 所述方法还包 括以下步骤:
- 获取一个或多个网站的网页导航信息;
-根据所获得的一个或多个网页导航信息, 来生成多级分类索引信 息, 其中, 所述多级分类索引中的各个分类索引按照预定拓朴结构相 互关联。
8. 根据权利要求 7所述的方法, 其中, 所述方法包括以下步骤:
-基于与所述多级分类索引信息对应的所述一个或多个网站的网页 导航信息, 获取与该多级分类索引信息中的各个分类索 S I分别对应的 网页;
-基于与所述各个分类索引相对应的网页来确定与该各个分类索 S I 分别对应的分类特征信息;
其中, 所述步骤 C包括以下步骤:
-基于所述候选词的特征信息以及各个分类索引的分类特征信息, 确定与所述候选词对应的分类索引。
9. 根据权利要求 8所述的方法, 其中, 所述预定拓朴结构包括多级 的拓朴结构, 其中相邻两级的分类索引之间为隶属关系, 其中, 所述 步骤 c包括以下步骤:
- 将所述候选词的特征信息与所述各个分类索引的分类特征信息相 比较, 以获取其分类特征信息与所述候选词的特征信息相似的分类索 引;
- 当所获得的分类索引包含底层分类索引时, 将该底层分类索引作 为所述候选词对应的分类索引。
10. 根据权利要求 9所述的方法, 其中, 所述步骤 c还包括以下步 骤:
- 当所获得的分类索引不包含底层索引节点时, 基于其中最低级别 的分类索引所对应的一个或多个分类相关网页以及所述候选词, 来生 成位于该最低级别的分类索引的下级分类索引;
-将所生成的底层分类索引作为与所述候选词对应的分类索引。
11. 根据权利要求 1至 9中任一项所述的方法, 其中, 所述方法还 包括以下步骤:
- 获取候选网站的一个或多个网页;
-根据所述候选网站的一个或多个网页, 确定该候选网站的站点特 征信息;
- 将所述候选网站的站点特征信息与各个分类索引的分类特征信息 进行比较, 以确定与该候选网站对应的一个或多个分类索引;
- 向该候选网站对应的候选用户提供该一个或多个分类索弓 I分别对 应的一个或多个^ ί类选词。
12. 根据权利要求 11 所述的方法, 其中, 所述方法还包括以下步 骤: -根据与所述候选网站对应的一个或多个分类索引, 获取所述候选 网站中与该一个或多个分类索弓 I分别对应的一个或多个候选网页;
-基于与各个分类索引对应的、 所述候选网站中的一个或多个候选 网页, 确定或更新与该各个分类索引对应的分类相关网页;
-基于所述更新后的与各个分类索引对应的分类相关网页, 更新各 个分类索引所对应的候选词的词条信息。
13. 一种用于生成词条信息的词条生成装置, 其中, 所述词条生成 装置包括:
第一获取装置, 用于获取候选词;
第二获取装置, 用于基于所述候选词进行搜索, 以获取所述候选词 的特征信息;
第一确定装置, 用于才艮据所述候选词的特征信息, 在多级分类索引 信息中确定与所述候选词对应的分类索引; 其中, 所述分类索引对应 至少一个分类相关网页;
第一生成装置, 用于才艮据与所述分类索引信息对应的至少一个分类 相关网页, 来确定与所述候选词对应的词条信息。
14. 根据权利要求 13所述的词条生成装置, 其中, 所述第二获取装 置包括:
第一搜索装置, 用于通过第一预定搜索引擎, 基于所述候选词执行 搜索, 以获取与所述候选词对应的一个或多个搜索结果网页;
第二确定装置, 用于才艮据所述一个或多个搜索结果网页, 来确定与 所述候选词对应的特征信息。
15. 根据权利要求 14所述的词条生成装置, 其中, 所述第二确定装 置包括:
关键词获取装置, 用于获取所述一个或多个搜索结果网页中所包含 的至少一个关键词;
权重获取装置, 用于获取所述至少一个关键词中的各个关键词的权 重信息;
第一子确定装置, 用于基于所获得的各个关键词及其相应的权重信 息, 来确定与所述候选词对应的特征信息。
16. 根据权利要求 14所述的词条生成装置, 其中, 所述第二确定装 置包括以下步骤:
模型确定装置, 用于通过预定主题确定模型, 根据所述一个或多个 搜索结果网页中的各个网页的网页内容, 来确定与所述一个或多个搜 索结果网页对应的主题相关信息;
第二子确定装置, 用于基于所确定的主题相关信息来确定与所述候 选词对应的特征信息。
17. 根据权利要求 13至 16中任一项所述的词条生成装置, 其中, 所述词条生成装置还包括:
第三获取装置, 用于获取与所述候选词对应的一项或多项网络发布 第三确定装置, 用于根据所获得的一项或多项网络发布信息来确定 所述候选词的重要度信息;
判断装置, 用于判断所述候选词的重要度信息是否满足预定重要度 条件;
其中, 所述第二获取装置用于:
- 当所述候选词的重要度信息满足预定重要度条件时, 获取所述候 选词的特征信息。
18. 根据权利要求 17所述的词条生成装置, 其中, 所述第三获取装 置包括:
第一子获取装置, 用于对所述候选词进行切词以获取多个子候选 词;
第二搜索装置, 用于通过第二预定搜索引擎, 基于各个子候选词执 行搜索以获取与各个子候选词对应的网络发布信息;
其中, 所述第三确定装置包括:
第三子确定装置, 用于基于各个子候选词对应的网络发布信息确定 该子候选词的子重要度信息;
第四子确定装置, 用于基于各个子候选词的子重要度信息确定所述 候选词的重要度信息。
19. 根据权利要求 13至 18中任一项所述的词条生成装置, 其中, 所述词条生成装置还包括:
导航获取装置, 用于获取一个或多个网站的网页导航信息; 第二生成装置, 用于根据所获得的一个或多个网页导航信息, 来生 成多级分类索引信息, 其中, 所述多级分类索引中的各个分类索引按 照预定拓朴结构相互关联。
20. 根据权利要求 19所述的词条生成装置, 其中, 所述词条生成装 置包括以下步骤:
第四获取装置, 用于基于与所述多级分类索弓 I信息对应的所述一个 或多个网站的网页导航信息, 获取与该多级分类索引信息中的各个分 类索引分别对应的网页;
第一特征确定装置, 用于基于与所述各个分类索引相对应的网页来 确定与该各个分类索 ^ I分别对应的分类特征信息;
其中, 所述第一确定装置用于:
-基于所述候选词的特征信息以及各个分类索引的分类特征信息, 确定与所述候选词对应的分类索引。
21. 根据权利要求 20所述的词条生成装置, 其中, 所述预定拓朴结 构包括多级的拓朴结构, 其中相邻两级的分类索引之间为隶属关系, 其中, 所述第一确定装置包括:
比较获取装置, 用于将所述候选词的特征信息与所述各个分类索弓 I 的分类特征信息相比较, 以获取其分类特征信息与所述候选词的特征 信息相似的分类索引;
第一分类确定装置, 用于当所获得的分类索引包含底层分类索引 时, 将该底层分类索引作为所述候选词对应的分类索引。
22. 根据权利要求 21所述的词条生成装置, 其中, 所述第一确定装 置还包括:
第三生成装置, 用于当所获得的分类索引不包含底层索引节点时, 基于其中最低级别的分类索引所对应的一个或多个分类相关网页以及 所述候选词, 来生成位于该最低级别的分类索引的下级分类索引; 第二分类确定装置, 用于将所生成的底层分类索引作为与所述候选 词对应的分类索引。
23. 根据权利要求 13至权利要求 21所述的词条生成装置, 其中, 所述词条生成装置还包括:
第一网页获取装置, 用于获取候选网站的一个或多个网页; 第二特征确定装置, 用于才艮据所述候选网站的一个或多个网页, 确 定该候选网站的站点特征信息;
第三分类确定装置, 用于将所述候选网站的站点特征信息与各个分 类索引的分类特征信息进行比较, 以确定与该候选网站对应的一个或 多个分类索引;
提供装置, 用于向该候选网站对应的候选用户提供该一个或多个分 类索引分别对应的一个或多个候选词。
24. 根据权利要求 23所述的词条生成装置, 其中, 所述词条生成装 置还包括:
第二网页获取装置, 用于根据与所述候选网站对应的一个或多个分 类索引, 获取所述候选网站中与该一个或多个分类索引分别对应的一 个或多个候选网页;
第一更新装置, 用于基于与各个分类索引对应的、 所述候选网站中 的一个或多个候选网页, 确定与该各个分类索引对应的分类相关网 页;
第一更新装置, 用于基于所述更新后的与各个分类索引对应的分类 相关网页, 更新各个分类索弓 I所对应的候选词的词条信息。
PCT/CN2014/079220 2013-06-28 2014-06-05 一种用于生成词条信息的方法和装置 WO2014206186A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201310268427.5A CN104252487B (zh) 2013-06-28 2013-06-28 一种用于生成词条信息的方法和装置
CN201310268427.5 2013-06-28

Publications (1)

Publication Number Publication Date
WO2014206186A1 true WO2014206186A1 (zh) 2014-12-31

Family

ID=52141011

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/079220 WO2014206186A1 (zh) 2013-06-28 2014-06-05 一种用于生成词条信息的方法和装置

Country Status (2)

Country Link
CN (1) CN104252487B (zh)
WO (1) WO2014206186A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776652B (zh) * 2015-11-24 2020-09-25 北京国双科技有限公司 数据处理方法及装置
CN108268552B (zh) * 2016-12-30 2020-08-11 北京国双科技有限公司 网站信息的处理方法及装置
CN109271615B (zh) * 2017-07-13 2023-10-31 北京搜狗科技发展有限公司 词条处理方法、装置和机器可读介质
CN107506473B (zh) * 2017-09-05 2020-10-27 郑州升达经贸管理学院 一种基于云计算的大数据检索方法
CN113282745B (zh) * 2020-02-20 2023-04-18 清华大学 事件百科文档自动生成方法和装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101251854A (zh) * 2008-03-19 2008-08-27 深圳先进技术研究院 一种生成检索词条的方法及数据检索方法和系统
US20090094020A1 (en) * 2007-10-05 2009-04-09 Fujitsu Limited Recommending Terms To Specify Ontology Space
CN101566995A (zh) * 2008-04-25 2009-10-28 北京搜狗科技发展有限公司 一种互联网信息整合发布的方法和系统
CN101957831A (zh) * 2009-07-17 2011-01-26 刘二中 文件内容的特征词的输入和处理方法
CN101986310A (zh) * 2010-11-16 2011-03-16 无敌科技(西安)有限公司 一种更新网络用语词典的方法及装置
WO2012000335A1 (zh) * 2010-06-30 2012-01-05 百度在线网络技术(北京)有限公司 与应用接口相结合的输入方法和设备

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7051023B2 (en) * 2003-04-04 2006-05-23 Yahoo! Inc. Systems and methods for generating concept units from search queries

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090094020A1 (en) * 2007-10-05 2009-04-09 Fujitsu Limited Recommending Terms To Specify Ontology Space
CN101251854A (zh) * 2008-03-19 2008-08-27 深圳先进技术研究院 一种生成检索词条的方法及数据检索方法和系统
CN101566995A (zh) * 2008-04-25 2009-10-28 北京搜狗科技发展有限公司 一种互联网信息整合发布的方法和系统
CN101957831A (zh) * 2009-07-17 2011-01-26 刘二中 文件内容的特征词的输入和处理方法
WO2012000335A1 (zh) * 2010-06-30 2012-01-05 百度在线网络技术(北京)有限公司 与应用接口相结合的输入方法和设备
CN101986310A (zh) * 2010-11-16 2011-03-16 无敌科技(西安)有限公司 一种更新网络用语词典的方法及装置

Also Published As

Publication number Publication date
CN104252487B (zh) 2019-05-03
CN104252487A (zh) 2014-12-31

Similar Documents

Publication Publication Date Title
US9262532B2 (en) Ranking entity facets using user-click feedback
US8352396B2 (en) Systems and methods for improving web site user experience
CN102171689B (zh) 用于提供搜索结果的方法、系统
US20110060717A1 (en) Systems and methods for improving web site user experience
US20170212899A1 (en) Method for searching related entities through entity co-occurrence
US20140181098A1 (en) Methods and systems for retrieval of experts based on user customizable search and ranking parameters
US9652544B2 (en) Generating snippets for prominent users for information retrieval queries
Zhang et al. The recommendation system of micro-blog topic based on user clustering
US10685073B1 (en) Selecting textual representations for entity attribute values
KR20100125682A (ko) 다수 분류 체계를 연동한 시멘틱 검색 방법 및 시스템
WO2014206186A1 (zh) 一种用于生成词条信息的方法和装置
US10691746B2 (en) Images for query answers
US9251202B1 (en) Corpus specific queries for corpora from search query
CN103294692A (zh) 一种信息推荐方法及系统
KR100954842B1 (ko) 카테고리 태그 정보를 이용한 웹 페이지 분류 방법, 그 시스템 및 이를 기록한 기록매체
JP5952711B2 (ja) 予測対象コンテンツにおける将来的なコメント数を予測する予測サーバ、プログラム及び方法
JP2009301221A (ja) 文書検索システム、文書検索方法、及びプログラム
US9336330B2 (en) Associating entities based on resource associations
US20160299911A1 (en) Processing search queries and generating a search result page including search object related information
JP5556711B2 (ja) カテゴリ分類処理装置、カテゴリ分類処理方法、カテゴリ分類処理プログラム記録媒体、カテゴリ分類処理システム
WO2015198113A1 (en) Processing search queries and generating a search result page including search object related information
JP2013168177A (ja) 情報提供プログラム、情報提供装置および検索サービスの提供方法
US20160335365A1 (en) Processing search queries and generating a search result page including search object information
JP2009211429A (ja) 情報提供方法、情報提供装置、情報提供プログラム、および該プログラムをコンピュータに記録した記録媒体
US10909112B2 (en) Method of and a system for determining linked objects

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14817533

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14817533

Country of ref document: EP

Kind code of ref document: A1