WO2017215244A1 - Method and device for providing relevant words - Google Patents

Method and device for providing relevant words Download PDF

Info

Publication number
WO2017215244A1
WO2017215244A1 PCT/CN2016/113175 CN2016113175W WO2017215244A1 WO 2017215244 A1 WO2017215244 A1 WO 2017215244A1 CN 2016113175 W CN2016113175 W CN 2016113175W WO 2017215244 A1 WO2017215244 A1 WO 2017215244A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
keyword
word set
related word
words
Prior art date
Application number
PCT/CN2016/113175
Other languages
French (fr)
Chinese (zh)
Inventor
李贤�
Original Assignee
广州视源电子科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州视源电子科技股份有限公司 filed Critical 广州视源电子科技股份有限公司
Publication of WO2017215244A1 publication Critical patent/WO2017215244A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis

Definitions

  • the present invention relates to the field of computer technologies, and in particular, to a method and apparatus for providing related words.
  • the shopping website and the search engine service website provide the keyword search function, that is, the user inputs the keyword of the product or technology that is to be searched, and the server searches for the corresponding result according to the keyword and returns it to the user.
  • the server In order to provide accurate search results, the server generally expands the keyword, that is, according to the keyword input by the user, finds the relevant word corresponding to the keyword, and provides the related word found to the user, and searches for the keyword through the keyword.
  • the search is based on the related words.
  • the existing related word expansion is extended by existing dictionaries, such as WordNet, "Synonym Lin", and the related words obtained in this way are quite limited in number, and the related words obtained may not keep up with the language.
  • the developmental changes cannot meet the requirements of timeliness of related words.
  • Embodiments of the present invention provide a method and apparatus for providing related words, which can provide a more and more accurate related words.
  • Taking a keyword input by the user as an input word acquiring a lower related word set of the keyword from the vocabulary database, and determining a relevance of each lower related word in the lower related word set to the keyword;
  • the relevant related words are selected to select related words provided to the user.
  • the episode related word set of the keyword is obtained from the vocabulary database according to the lower related word set of the keyword, and each of the upper related word sets is determined.
  • the relevance of the related words to the keywords is specifically:
  • the input word is updated by the lower-level related word, and the lower-level related word set of the updated input word is obtained from the entry database;
  • the lower related word set including the keyword is selected from the lower related word set, and the input word corresponding to the lower related word set including the keyword is used as the upper related word to obtain the keyword.
  • a correlation word set wherein, the degree of relevance of the keyword corresponding to the lower-level related word set in the lower-level related word set including the keyword, as the input word is used as a superordinate related word The relevance of the keyword;
  • the manner of obtaining the lower related word set from the entry database includes:
  • an entry containing the to-be-recognized related words is obtained from the entry database, and the related to-be-tested correlation is The entry of the word is subjected to word segmentation and screening, and the control word set of the related word to be tested is obtained;
  • the to-be-relevant related word is a lower related word of the input word, and obtains a lower related word a set; wherein the absolute value is used as a correlation between the lower related word and the keyword.
  • the entry containing the input word is obtained from the vocabulary database, and the word is segmented and filtered, and the related word set is obtained, specifically including :
  • the control word set of the relevant words to be tested includes:
  • a word belonging to the core word in the user dictionary is extracted from the second word set as a control word, and a control word set is obtained.
  • the obtaining method further includes:
  • the correlation threshold of each of the superordinate related words in the superordinate related word set of the keyword and the keyword is subtracted from the screening threshold, and the normalization of the correlation degree is completed.
  • the implementation of the present invention further provides an apparatus for providing related words, including:
  • a lower-level related word set module configured to use a keyword input by a user as an input word, obtain a lower-level related word set of the keyword from a vocabulary database, and determine each lower-level related word and place in the lower-level related word set The relevance of the keywords;
  • a superordinate related word set module configured to acquire a superordinate related word set of the keyword from a vocabulary database according to a lower related word set of the keyword, and determine each superordinate related word in the episode related word set The relevance of the keyword;
  • a related word set module configured to use a union of a lower related word set and a superordinate related word set of the keyword as an output related word set of the keyword, and output related according to each of the output related word sets The relevance of the word, the related words provided to the user are selected in the output related word set.
  • the upper related word set module specifically includes: a lower word set obtaining unit, a threshold value determining unit, and a higher word set obtaining unit, wherein
  • the lower word set obtaining unit is configured to update each input word with the lower related word for each lower related word in the lower related word set, and obtain the lower related word set of the updated input word from the entry database ;
  • the threshold determining unit is configured to determine whether the total number of the lower related word sets is greater than a preset threshold
  • the upper word set obtaining unit is configured to: when determining that the total number of the lower related word sets is greater than a preset threshold, select a lower related word set including the keyword from the lower related word set, and include the The lower related words of the keyword Corresponding input words are used as superordinate related words to obtain a set of superordinate related words of the keyword; wherein, the input words corresponding to the keywords in the lower related word set including the keyword and the lower related related words set Correlation degree as the relevance of the input word to the keyword when it is a superordinate related word;
  • the lower word acquiring unit is further configured to: when determining that the total number of the lower related word sets is less than a preset threshold, continue to perform the following operations: for each lower related word in the lower related word set of the updated input words, The lower related words update the input words again, and obtain the lower related words of the newly updated input words from the entry database until the total number of the lower related words is greater than a preset threshold.
  • the lower related word set module and the lower word set obtaining unit further comprise a unit for acquiring a lower related word set from the entry database, specifically:
  • the related word set unit is configured to obtain an entry including the input word from the vocabulary database according to the input word, and perform word segmentation and screening on the term, and obtain a related word set to be tested;
  • a comparison word set unit configured to acquire, for each of the to-be-recognized related words in the to-be-tested related word set, an entry including the to-be-tested related word from the entry database according to the inquiring related word, And performing word segmentation and screening on the words of the related words to be inspected, and obtaining a comparison word set of the related words to be tested;
  • a judgment obtaining unit configured to determine, when the absolute value of the intersection of the control word set of the to-be-tested related word and the to-be-tested related word set is greater than a screening threshold, the in-relevant related word is a lower correlation of the input word a word, obtaining a lower related word set; wherein the absolute value is used as a relevance of the lower related word to the keyword.
  • the unit of the related words to be tested specifically includes:
  • a first lexical sub-unit configured to obtain, from the vocabulary database, an entry that includes the input word and is sorted before the Mth position according to the input word;
  • a first adjustment subunit configured to perform format adjustment on the obtained entry according to a standard entry format
  • the first call subunit is used to call the word segmentation tool
  • a first word segment sub-unit configured to perform word segmentation on the format-adjusted term by using the word segmentation tool to obtain a first word set
  • a first extracting subunit configured to extract a word belonging to a core word in a user word from the first word set as a related word to be tested, to obtain a related word set to be tested;
  • the user dictionary is a word segmentation tool which provided;
  • control word set unit specifically includes:
  • a second term sub-unit configured to obtain, according to the to-be-tested related words, an entry that includes the to-be-tested related words and is sorted before the Mth position;
  • a second adjustment subunit configured to perform format adjustment on the term that includes the to-be-recognized related word and is sorted before the Mth position according to the standard entry format
  • a second calling subunit configured to invoke the word segmentation tool
  • a second word segment sub-unit configured to perform word segmentation on the format-adjusted term containing the to-be-tested related words and sorted before the Mth position by using the word segmentation tool to obtain a second word set;
  • a second extraction subunit configured to obtain a comparison word set according to a word extracted from the second word set and belonging to a core word in the user dictionary as a control word.
  • the apparatus for providing related words further includes a normalization module:
  • the normalization module is configured to subtract the relevance of each of the lower related words in the lower related word set of the keyword from the keyword by the screening threshold; and to use the keyword
  • the correlation between each of the superordinate related words in the epistem-related word set and the keyword is subtracted from the screening threshold, and the normalization of the correlation is completed.
  • the method and device for providing a related word provided by an embodiment of the present invention obtains a lower related word set of the keyword from a vocabulary database by using a keyword provided by a user, and then, according to the lower related word set, the keyword is extracted.
  • the episode of related words, and finally the union of the lower related words and the episodes as the output related words of the keyword can expand a large number of related words to provide for user selection, and further, by determining related words Relevance can be accurately described as the degree of correlation between related words and keywords, and subsequent related words can be selected according to the relevance of related words, and can be accurately described by the relevance of related words.
  • Provide related words can be accurately described as the degree of correlation between related words and keywords, and subsequent related words can be selected according to the relevance of related words, and can be accurately described by the relevance of related words.
  • FIG. 1 is a schematic flow chart of an embodiment of a method for providing related words provided by the present invention
  • step S2 is a schematic flow chart of an embodiment of step S2 of the method for providing related words provided in FIG. 1;
  • step S3 is a schematic flow chart of an implementation of step S3 of the method for providing related words provided by FIG. 1;
  • FIG. 4 is a schematic flow chart of another embodiment of a method for providing related words provided by the present invention.
  • FIG. 5 is a schematic structural diagram of an embodiment of an apparatus for providing related words according to the present invention.
  • FIG. 6 is a schematic structural diagram of an embodiment of a superordinate related word set module of a device for providing related words according to the present invention
  • FIG. 7 is a schematic structural diagram of an embodiment of a unit for acquiring a lower related word set of the apparatus for providing related words provided by the present invention.
  • FIG. 8 is a schematic structural diagram of an embodiment of a to-be-relevant related word set unit of a device for providing related words according to the present invention.
  • FIG. 9 is a schematic structural diagram of an embodiment of a control word set unit of the apparatus for providing related words provided by the present invention.
  • FIG. 1 is a schematic flowchart of an embodiment of a method for providing related words provided by the present invention
  • FIG. 2 is an embodiment of step S2 of the method for providing related words provided by FIG. 1
  • FIG. 3 is a schematic flow chart of an implementation of step S3 of the method for providing related words provided in FIG. 1 .
  • the paper database for example, China Knowledge Network
  • the related words of the keyword Java are obtained as an example, and the method for providing related words in the embodiment is described in detail, and the method includes the following step:
  • Step S1 using the keyword Java input by the user as an input word, obtaining a lower-level related word set of the keyword Java from the entry database, and determining the relevance of each lower-level related word in the lower-level related word set to the keyword .
  • Step S1 includes steps S11 to S13, as follows:
  • the search engine to obtain, according to the input word Java, the entry containing the input word Java and sorted before the Mth position, for example, the first 50 pages of the abstract as a term, or searching for keywords in the wiki The first 500 abstracts of Java;
  • Formatting the terms according to the standard entry format for example, unifying lowercase in the entry into uppercase, deleting extra spaces in the entry, punctuation in the unified entry, or formatting the entry in full-width or The half-width format is unified into one type.
  • the word segmentation tool is called; preferably, the word segmentation tool is a jieba word segmentation tool, but is not limited to this word segmentation tool.
  • Extracting, according to the keyword extraction algorithm, words related to the input words from the first word set as related words ⁇ a 1 , . . . , a n ⁇ to obtain a related word set A ⁇ a 1 ,... , a n ⁇ .
  • the core word can be extracted from the first word set as a related word to be tested by using a word segmentation tool or by adding a dictionary through the device providing the related word.
  • the terms of the related words are segmented and screened for the words of the related words to be inspected, and the control words of the related words to be tested are obtained.
  • this step S22 is the same as the specific implementation process of the previous step S21, except that the input word in step S21 becomes the related word ⁇ a 1 , . . . , a n ⁇ , and then the obtained test is obtained.
  • a i related words of the set of words associated quarantine B ai ⁇ b i1, ... , b in ⁇ as a control experiment to be the set of words associated word a i, and therefore will not describe them here.
  • the noise words can be filtered out, and the efficiency of acquiring the lower related words is improved.
  • step S31 obtain a consistent manner the bit associated word set acquires the next bit associated word set with the step S2 above embodiment, this No longer.
  • the upper-level related word set of Java is a set of input words in a set, that is, the lower-level related word set of each element in the set has the same element as Java. .
  • the user can be provided with relevant information from multiple dimensions. word.
  • the output related word set selects related words provided to the user.
  • the obtaining method further comprises normalizing the correlation:
  • the correlation threshold of each of the superordinate related words in the superordinate related word set of the keyword and the keyword is subtracted from the screening threshold, and the normalization of the correlation degree is completed.
  • the purpose of normalization is to make the value of the correlation between the related words of the keyword-related output words and the degree of relevance of the keyword can be based on 0, the higher the value, the related words and keywords.
  • the higher the degree of correlation the more convenient to select the relevant words provided to the user in the output related word set in step S4.
  • the method for providing a related word can filter out the influence of a noise word and improve the quality of the obtained related word by using the related word after the obtained lower-order related words to be verified as the lower related words. In other words, the accuracy of the relevant words provided to the user can be ensured.
  • the upper-level related words of the keyword are reversely obtained when the lower-level related word set is continued, and the number of related words provided by the user can be expanded to a large extent, and the upper-level related words can be confirmed. The quality of related words.
  • FIG. 4 it is a schematic flowchart of another embodiment of a method for providing related words according to the present invention.
  • the method for providing related words in this embodiment is as follows: using a thesis database and a Wikipedia database as the entry database respectively, and obtaining corresponding data therefrom. a first output related word set and a second output related word set, and then the union of the first output related word set and the second output related word set as a final output related word set of the keyword; wherein, in the Wikipedia database
  • the manner in which the second output related word set is obtained is the same as the manner in which the output related word set is obtained in the paper database in the previous embodiment.
  • two different vocabulary databases are used, and the vocabulary database is a essay database and a Wikipedia database for mining related words.
  • the extension of related words is highly targeted, and the corpus can be avoided.
  • the related words provided to the user are too one-sided.
  • FIG. 5 is a schematic structural diagram of an embodiment of an apparatus for providing related words provided by the present invention, which can implement the entire process of the foregoing two embodiments, and the apparatus for providing related words includes:
  • the lower related word set module 10 is configured to use a keyword input by the user as an input word, obtain a lower related word set of the keyword from the entry database, and determine each lower related word in the lower related word set. The relevance of the keyword;
  • the superordinate related word set module 20 is configured to obtain a superordinate related word set of the keyword from the vocabulary database according to the lower related word set of the keyword, and determine each superordinate related word in the superordinate related word set Relevance to the keyword;
  • a related word set module 30 configured to use a union of a lower related word set and a superordinate related word set of the keyword as an output related word set of the keyword, and output each of the output related word sets according to the output The relevance of the related words, the related words provided to the user are selected in the output related word set.
  • FIG. 6 is a schematic structural diagram of an embodiment of a superordinate related word set module of the apparatus for providing related words provided by the present invention; the upper related word set module 30 specifically includes a lower word set obtaining unit 31, a threshold value determining unit 32, and a higher word set obtaining unit 33, wherein
  • the lower word set obtaining unit 31 is configured to update each input word with the lower related word for each lower related word in the lower related word set, and obtain the lower related word of the updated input word from the entry database set;
  • the threshold determining unit 32 is configured to determine whether the total number of the lower related word sets is greater than a preset threshold
  • the upper word set obtaining unit 33 is configured to: when determining that the total number of the lower related word sets is greater than a preset threshold, select a lower related word set including the keyword from the lower related word set, and use the upper related word set The input word corresponding to the lower related word set of the keyword is used as a superordinate related word to obtain a superordinate related word set of the keyword; wherein the keyword in the lower related word set including the keyword and the keyword Correlation degree of the input word corresponding to the lower related word set, as the relevance of the input word to the keyword when it is a superordinate related word;
  • the lower word set obtaining unit 31 is further configured to: when determining that the total number of the lower related word sets is less than a preset threshold, continue to perform the following operations: for each lower related word in the lower related word set of the updated input word, The input word is updated again by the lower related word, and the lower related word set of the newly updated input word is obtained from the entry database until the total number of the lower related word set is greater than a preset threshold.
  • FIG. 7 A schematic structural diagram of an embodiment of a unit for obtaining a lower-level related word set of a device for providing related words, specifically comprising:
  • the related word set unit 1 is configured to obtain, from the entry database, a word including the input word according to the input word And classifying and screening the terms to obtain a set of related words to be tested;
  • a comparison word set unit 2 configured to acquire, for each of the to-be-recognized related words in the to-be-tested related word set, an entry including the to-be-recognized related word from the entry database according to the inquiring related word And classifying and screening the terms of the related words to be inspected, and obtaining a comparison word set of the related words to be tested;
  • a determining obtaining unit 3 configured to: when determining that an absolute value of an intersection of the comparison word set of the to-be-tested related word and the to-be-tested related word set is greater than a screening threshold, the inquiring related word is a lower position of the input word a related word, obtaining a lower related word set; wherein the absolute value is used as a relevance of the lower related word to the keyword.
  • FIG. 8 is a schematic structural diagram of an embodiment of a to-be-recognized related word set unit of the apparatus for providing related words according to the present invention.
  • a first lexical sub-unit 11 configured to obtain, from the vocabulary database, an entry that includes the input word and is sorted before the Mth position according to the input word;
  • the first adjusting sub-unit 12 is configured to perform format adjustment on the obtained entry according to the standard entry format
  • a first word segment sub-unit 14 configured to perform word segmentation on the format-adjusted term by using the word segmentation tool to obtain a first word set
  • a first extracting sub-unit 15 configured to extract a word belonging to a core word in the user word from the first word set as a related word to be tested, and obtain a related word set to be tested; wherein the user dictionary is Provided by the word segmentation tool;
  • FIG. 9 is a schematic structural diagram of an embodiment of a control word set unit of the apparatus for providing related words according to the present invention.
  • the control word set unit 2 specifically includes:
  • a second lexical sub-unit 21 configured to obtain, according to the to-be-tested related words, an entry that includes the to-be-tested related words and is sorted before the Mth position;
  • a second adjustment sub-unit 22 configured to perform format adjustment on the term that includes the to-be-recognized related word and is sorted before the Mth position according to the standard entry format;
  • a second calling subunit 23 configured to invoke the word segmentation tool
  • a second word segment sub-unit 24 configured to use the word segmentation tool to perform segmentation on the format-adjusted term containing the to-be-tested related words and sorted before the Mth position, to obtain a second word set;
  • the second extracting sub-unit 25 is configured to obtain, from the second word set, a word belonging to the core word in the user dictionary as a control word, and obtain a control word set.
  • the apparatus for providing related words further includes a normalization module 40:
  • the normalization module is configured to subtract the relevance of each of the lower related words in the lower related word set of the keyword from the keyword by the screening threshold; and to use the keyword
  • the correlation between each of the superordinate related words in the epistem-related word set and the keyword is subtracted from the screening threshold, and the normalization of the correlation is completed.
  • the device for providing related words can filter out the influence of the noise word and improve the quality of the obtained related word by using the related words after the obtained lower-order related words to be verified as the lower related words. In other words, the accuracy of the relevant words provided to the user can be ensured.
  • the upper-level related words of the keyword are reversely obtained when the lower-level related word set is continued, and the number of related words provided by the user can be expanded to a large extent, and the upper-level related words can be confirmed. The quality of related words.
  • the storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), or a random access memory (RAM).

Abstract

A method and a device for providing relevant words, the method comprising: using user-inputted keywords as input words, obtaining from an entry database a word set subordinate to the keywords, and determining the degree of relevance to the keywords of each subordinate word in said subordinate word set (S1); on the basis of the word set subordinate to the keywords, obtaining from the entry database a set of words superordinate to the keywords, and determining the relevance to the keywords of each superordinate word in said superordinate word set (S2); using the combined set of the subordinate word set and the superordinate word set as the relevant word set output with respect to the keywords, then, according to the degree of relevance of each relevant word in the output word set, selecting relevant words from said output word set to provide to the user (S3). Use of the present method makes possible provision of a greater number of, and more precise, relevant words.

Description

提供相关词的方法和装置Method and apparatus for providing related words 技术领域Technical field
本发明涉及计算机技术领域,尤其涉及一种提供相关词的方法和装置。The present invention relates to the field of computer technologies, and in particular, to a method and apparatus for providing related words.
背景技术Background technique
目前,购物网站以及搜索引擎服务网站都提供的关键词搜索的功能,即用户输入想要搜索的商品或技术的关键词,服务器则根据该关键词搜索出相应的结果并返回给用户。服务器为了提供准确的搜索结果,服务器一般会对关键词进行扩展,即根据用户输入的关键词,查找出关键词对应的相关词,并提供查找到的相关词给用户,在用户通过关键词搜索而未能得到满意的搜索结果时,就根据相关词进行搜索。但现有的相关词扩展是通过已有词典进行扩展,例如WordNet、《同义词林》,而这种方式所获得的相关词在数量上相当有限,而且所获得的相关词有可能跟不上语言的发展变化,不能满足相关词对时效性的要求。At present, the shopping website and the search engine service website provide the keyword search function, that is, the user inputs the keyword of the product or technology that is to be searched, and the server searches for the corresponding result according to the keyword and returns it to the user. In order to provide accurate search results, the server generally expands the keyword, that is, according to the keyword input by the user, finds the relevant word corresponding to the keyword, and provides the related word found to the user, and searches for the keyword through the keyword. When the satisfactory search results are not obtained, the search is based on the related words. However, the existing related word expansion is extended by existing dictionaries, such as WordNet, "Synonym Lin", and the related words obtained in this way are quite limited in number, and the related words obtained may not keep up with the language. The developmental changes cannot meet the requirements of timeliness of related words.
发明内容Summary of the invention
本发明实施例提出一种提供相关词的方法和装置,能够提供数量更多且更准确的相关词。Embodiments of the present invention provide a method and apparatus for providing related words, which can provide a more and more accurate related words.
本发明实施例提出的一种提供相关词的方法,包括:A method for providing related words according to an embodiment of the present invention includes:
以用户输入的关键词作为输入词,从词条数据库中获取所述关键词的下位相关词集,以及确定所述下位相关词集中的每一个下位相关词与所述关键词的相关度;Taking a keyword input by the user as an input word, acquiring a lower related word set of the keyword from the vocabulary database, and determining a relevance of each lower related word in the lower related word set to the keyword;
根据所述关键词的下位相关词集,从词条数据库中获取所述关键词的上位相关词集,以及确定所述上位相关词集中的每一个上位相关词与所述关键词的相关度;Obtaining, according to the lower-level related word set of the keyword, the episode related words of the keyword from the vocabulary database, and determining the relevance of each of the superordinate related words in the superordinate related word set to the keyword;
将所述关键词的下位相关词集和上位相关词集的并集作为所述关键词的输出相关词集,并依据所述输出相关词集中的每一个输出相关词的相关度,在所述输出相关词集中选择提供给所述用户的相关词。Taking the union of the lower related word set and the superordinate related word set of the keyword as the output related word set of the keyword, and outputting the relevance degree of the related word according to each of the output related word sets, The relevant related words are selected to select related words provided to the user.
作为本发明实施例的进一步改进,所述根据所述关键词的下位相关词集,从词条数据库中获取所述关键词的上位相关词集,以及确定所述上位相关词集中的每一个上位相关词与所述关键词的相关度,具体为:According to a further improvement of the embodiment of the present invention, the episode related word set of the keyword is obtained from the vocabulary database according to the lower related word set of the keyword, and each of the upper related word sets is determined. The relevance of the related words to the keywords is specifically:
针对所述下位相关词集中的每一个下位相关词,以该下位相关词来更新输入词,从词条数据库中获取更新后的输入词的下位相关词集; And for each lower-level related word in the lower-level related word set, the input word is updated by the lower-level related word, and the lower-level related word set of the updated input word is obtained from the entry database;
判断下位相关词集的总数量是否大于预设阈值;Determining whether the total number of the lower related words is greater than a preset threshold;
若是,则从下位相关词集中筛选出包含所述关键词的下位相关词集,并将所述包含所述关键词的下位相关词集对应的输入词作为上位相关词,获得所述关键词的上位相关词集;其中,在所述包含所述关键词的下位相关词集中的所述关键词与该下位相关词集对应的输入词的相关度,作为该输入词在作为上位相关词时与所述关键词的相关度;If yes, the lower related word set including the keyword is selected from the lower related word set, and the input word corresponding to the lower related word set including the keyword is used as the upper related word to obtain the keyword. a correlation word set; wherein, the degree of relevance of the keyword corresponding to the lower-level related word set in the lower-level related word set including the keyword, as the input word is used as a superordinate related word The relevance of the keyword;
若否,则继续执行以下操作:针对更新后的输入词的下位相关词集中的每一个下位相关词,以该下位相关词再次更新输入词,从词条数据库中获取再次更新后的输入词的下位相关词集,直至下位相关词集的总数量大于预设阈值。If not, proceeding to: for each subordinate related word in the lower related word set of the updated input word, updating the input word again with the lower related word, and obtaining the updated input word from the entry database The lower related word set until the total number of lower related words is greater than a preset threshold.
进一步地,上述从词条数据库中获取下位相关词集的方式具体包括:Further, the manner of obtaining the lower related word set from the entry database includes:
根据所述输入词,从词条数据库中获取包含所述输入词的词条,并对所述词条进行分词和筛选,获得待验相关词集;Obtaining an entry containing the input word from the entry database according to the input word, and performing word segmentation and screening on the entry to obtain a related word set to be tested;
对于所述待验相关词集中的每一个待验相关词,根据所述待验相关词,从所述词条数据库中获取包含所述待验相关词的词条,并对所述待验相关词的词条进行分词和筛选,获得所述待验相关词的对照词集;For each of the to-be-recognized related words in the to-be-tested related word set, according to the to-be-recognized related words, an entry containing the to-be-recognized related words is obtained from the entry database, and the related to-be-tested correlation is The entry of the word is subjected to word segmentation and screening, and the control word set of the related word to be tested is obtained;
当判定所述待验相关词的对照词集与所述待验相关词集的交集的绝对值大于筛选阈值时,所述待验相关词为所述输入词的下位相关词,获得下位相关词集;其中,所述绝对值作为所述下位相关词与所述关键词的相关度。When it is determined that the absolute value of the intersection of the comparison word set of the to-be-tested related word and the to-be-tested related word set is greater than a screening threshold, the to-be-relevant related word is a lower related word of the input word, and obtains a lower related word a set; wherein the absolute value is used as a correlation between the lower related word and the keyword.
作为本发明的进一步改进,所述根据所述输入词,从词条数据库中获取包含所述输入词的词条,并对所述词条进行分词和筛选,获得待验相关词集,具体包括:As a further improvement of the present invention, according to the input word, the entry containing the input word is obtained from the vocabulary database, and the word is segmented and filtered, and the related word set is obtained, specifically including :
根据所述输入词,从词条数据库中获取包含所述输入词且排序在第M位前的词条;Obtaining, according to the input word, an entry that includes the input word and is sorted before the Mth position;
根据标准词条格式,对获取的词条进行格式调整;Formatting the acquired terms according to the standard entry format;
调用分词工具;Call the word segmentation tool;
利用所述分词工具对格式调整后的词条进行分词,获得第一词语集;Using the word segmentation tool to perform segmentation on the format-adjusted entry to obtain a first term set;
从所述第一词语集中提取属于用户词典中的核心词的词语作为待验相关词,获得待验相关词集;其中,所述用户词典是由所述分词工具提供的;Extracting, from the first word set, a word belonging to a core word in the user dictionary as a related word to be tested, and obtaining a related word set; wherein the user dictionary is provided by the word segmentation tool;
以及,所述根据所述待验相关词,从所述词条数据库中获取包含所述待验相关词的词条,并对所述待验相关词的词条进行分词和筛选,获得所述待验相关词的对照词集,具体包括:And obtaining, according to the to-be-recognized related words, an entry including the to-be-recognized related word from the entry database, and performing word segmentation and screening on the inquiring related term to obtain the The control word set of the relevant words to be tested includes:
根据所述待验相关词,从词条数据库中获取包含所述待验相关词且排序在第M位前的词条;Obtaining, according to the inquiring related words, an entry that includes the to-be-tested related words and is sorted before the Mth position;
根据所述标准词条格式,对所述包含所述待验相关词且排序在第M位前的词条进行格式调整; Formatting the entry containing the to-be-tested related words and sorting before the Mth position according to the standard entry format;
调用所述分词工具;Calling the word segmentation tool;
利用所述分词工具对格式调整后的包含所述待验相关词且排序在第M位前的词条进行分词,获得第二词语集;Using the word segmentation tool to perform segmentation on the format-adjusted term containing the to-be-tested related words and sorted before the Mth position, to obtain a second set of words;
从所述第二词语集中提取属于用户词典中的核心词的词语作为对照词,获得对照词集。A word belonging to the core word in the user dictionary is extracted from the second word set as a control word, and a control word set is obtained.
具体地,所述关键词的下位相关词集和所述上位相关词集的交集包含在所述关键词的输出相关词集中,则包含在所述交集中的每一个输出相关词的相关度为T,T=(T1+T2)/2;其中,T1为在该输出相关词作为下位相关词时与所述关键词的相关度,T2作为在该输出相关词作为上位相关词时与所述关键词的相关度。Specifically, the intersection of the lower related word set of the keyword and the superordinate related word set is included in the output related word set of the keyword, and the correlation degree of each output related word included in the intersection is T, T = (T1 + T2) / 2; wherein T1 is the degree of relevance to the keyword when the output related word is a lower related word, and T2 is used when the output related word is a higher related word The relevance of the keyword.
作为本发明的进一步改,所述获取方法还包括:As a further modification of the present invention, the obtaining method further includes:
将所述关键词的下位相关词集中的每一个下位相关词与所述关键词的相关度均减去所述筛选阈值;And subtracting the screening threshold from each of the lower related words in the lower related word set of the keyword and the relevance of the keyword;
将所述关键词的上位相关词集中的每一个上位相关词与所述关键词的相关度均减去所述筛选阈值,完成相关度的归一化。The correlation threshold of each of the superordinate related words in the superordinate related word set of the keyword and the keyword is subtracted from the screening threshold, and the normalization of the correlation degree is completed.
相应地,本发明实施还提供一种提供相关词的装置,包括:Correspondingly, the implementation of the present invention further provides an apparatus for providing related words, including:
下位相关词集模块,用于以用户输入的关键词作为输入词,从词条数据库中获取所述关键词的下位相关词集,以及确定所述下位相关词集中的每一个下位相关词与所述关键词的相关度;a lower-level related word set module, configured to use a keyword input by a user as an input word, obtain a lower-level related word set of the keyword from a vocabulary database, and determine each lower-level related word and place in the lower-level related word set The relevance of the keywords;
上位相关词集模块,用于根据所述关键词的下位相关词集,从词条数据库中获取所述关键词的上位相关词集,以及确定所述上位相关词集中的每一个上位相关词与所述关键词的相关度;a superordinate related word set module, configured to acquire a superordinate related word set of the keyword from a vocabulary database according to a lower related word set of the keyword, and determine each superordinate related word in the episode related word set The relevance of the keyword;
输出相关词集模块,用于将所述关键词的下位相关词集和上位相关词集的并集作为所述关键词的输出相关词集,并依据所述输出相关词集中的每一个输出相关词的相关度,在所述输出相关词集中选择提供给所述用户的相关词。And outputting a related word set module, configured to use a union of a lower related word set and a superordinate related word set of the keyword as an output related word set of the keyword, and output related according to each of the output related word sets The relevance of the word, the related words provided to the user are selected in the output related word set.
作为本发明实施例的进一步改进,所述上位相关词集模块具体包括:下位词集获取单元、阈值判断单元和上位词集获取单元,其中,As a further improvement of the embodiment of the present invention, the upper related word set module specifically includes: a lower word set obtaining unit, a threshold value determining unit, and a higher word set obtaining unit, wherein
所述下位词集获取单元,用于针对所述下位相关词集中的每一个下位相关词,以该下位相关词来更新输入词,从词条数据库中获取更新后的输入词的下位相关词集;The lower word set obtaining unit is configured to update each input word with the lower related word for each lower related word in the lower related word set, and obtain the lower related word set of the updated input word from the entry database ;
所述阈值判断单元,用于判断下位相关词集的总数量是否大于预设阈值;The threshold determining unit is configured to determine whether the total number of the lower related word sets is greater than a preset threshold;
所述上位词集获取单元,用于当判断下位相关词集的总数量大于预设阈值时,从下位相关词集中筛选出包含所述关键词的下位相关词集,并将所述包含所述关键词的下位相关词集 对应的输入词作为上位相关词,获得所述关键词的上位相关词集;其中,在所述包含所述关键词的下位相关词集中的所述关键词与该下位相关词集对应的输入词的相关度,作为该输入词在作为上位相关词时与所述关键词的相关度;The upper word set obtaining unit is configured to: when determining that the total number of the lower related word sets is greater than a preset threshold, select a lower related word set including the keyword from the lower related word set, and include the The lower related words of the keyword Corresponding input words are used as superordinate related words to obtain a set of superordinate related words of the keyword; wherein, the input words corresponding to the keywords in the lower related word set including the keyword and the lower related related words set Correlation degree as the relevance of the input word to the keyword when it is a superordinate related word;
所述下位词获取单元,还用于当判断下位相关词集的总数量小于预设阈值时,继续执行以下操作:针对更新后的输入词的下位相关词集中的每一个下位相关词,以该下位相关词再次更新输入词,从词条数据库中获取再次更新后的输入词的下位相关词集,直至下位相关词集的总数量大于预设阈值。The lower word acquiring unit is further configured to: when determining that the total number of the lower related word sets is less than a preset threshold, continue to perform the following operations: for each lower related word in the lower related word set of the updated input words, The lower related words update the input words again, and obtain the lower related words of the newly updated input words from the entry database until the total number of the lower related words is greater than a preset threshold.
进一步地,所述下位相关词集模块和所述下位词集获取单元还包括用于从词条数据库中获取下位相关词集的单元,具体为:Further, the lower related word set module and the lower word set obtaining unit further comprise a unit for acquiring a lower related word set from the entry database, specifically:
待验相关词集单元,用于根据所述输入词,从词条数据库中获取包含所述输入词的词条,并对所述词条进行分词和筛选,获得待验相关词集;The related word set unit is configured to obtain an entry including the input word from the vocabulary database according to the input word, and perform word segmentation and screening on the term, and obtain a related word set to be tested;
对照词集单元,用于对于所述待验相关词集中的每一个待验相关词,根据所述待验相关词,从所述词条数据库中获取包含所述待验相关词的词条,并对所述待验相关词的词条进行分词和筛选,获得所述待验相关词的对照词集;和a comparison word set unit, configured to acquire, for each of the to-be-recognized related words in the to-be-tested related word set, an entry including the to-be-tested related word from the entry database according to the inquiring related word, And performing word segmentation and screening on the words of the related words to be inspected, and obtaining a comparison word set of the related words to be tested; and
判断获取单元,用于当判定所述待验相关词的对照词集与所述待验相关词集的交集的绝对值大于筛选阈值时,所述待验相关词为所述输入词的下位相关词,获得下位相关词集;其中,所述绝对值作为所述下位相关词与所述关键词的相关度。a judgment obtaining unit, configured to determine, when the absolute value of the intersection of the control word set of the to-be-tested related word and the to-be-tested related word set is greater than a screening threshold, the in-relevant related word is a lower correlation of the input word a word, obtaining a lower related word set; wherein the absolute value is used as a relevance of the lower related word to the keyword.
进一步地,所述待验相关词集单元,具体包括:Further, the unit of the related words to be tested specifically includes:
第一词条子单元,用于根据所述输入词,从词条数据库中获取包含所述输入词且排序在第M位前的词条;a first lexical sub-unit, configured to obtain, from the vocabulary database, an entry that includes the input word and is sorted before the Mth position according to the input word;
第一调整子单元,用于根据标准词条格式,对获取的词条进行格式调整;a first adjustment subunit, configured to perform format adjustment on the obtained entry according to a standard entry format;
第一调用子单元,用于调用分词工具;The first call subunit is used to call the word segmentation tool;
第一分词子单元,用于利用所述分词工具对格式调整后的词条进行分词,获得第一词语集;和,a first word segment sub-unit, configured to perform word segmentation on the format-adjusted term by using the word segmentation tool to obtain a first word set; and,
第一提取子单元,用于从所述第一词语集中提取属于用户词中的核心词的词语作为待验相关词,获得待验相关词集;其中,所述用户词典是由所述分词工具提供的;a first extracting subunit, configured to extract a word belonging to a core word in a user word from the first word set as a related word to be tested, to obtain a related word set to be tested; wherein the user dictionary is a word segmentation tool which provided;
以及,所述对照词集单元具体包括:And the control word set unit specifically includes:
第二词条子单元,用于根据所述待验相关词,从词条数据库中获取包含所述待验相关词且排序在第M位前的词条;a second term sub-unit, configured to obtain, according to the to-be-tested related words, an entry that includes the to-be-tested related words and is sorted before the Mth position;
第二调整子单元,用于根据所述标准词条格式,对所述包含所述待验相关词且排序在第M位前的词条进行格式调整; a second adjustment subunit, configured to perform format adjustment on the term that includes the to-be-recognized related word and is sorted before the Mth position according to the standard entry format;
第二调用子单元,用于调用所述分词工具;a second calling subunit, configured to invoke the word segmentation tool;
第二分词子单元,用于利用所述分词工具对格式调整后的包含所述待验相关词且排序在第M位前的词条进行分词,获得第二词语集;和,a second word segment sub-unit, configured to perform word segmentation on the format-adjusted term containing the to-be-tested related words and sorted before the Mth position by using the word segmentation tool to obtain a second word set;
第二提取子单元,用于根据从所述第二词语集中提取属于用户词典中的核心词的词语作为对照词,获得对照词集。a second extraction subunit, configured to obtain a comparison word set according to a word extracted from the second word set and belonging to a core word in the user dictionary as a control word.
进一步地,所述提供相关词的装置还包括归一化模块:Further, the apparatus for providing related words further includes a normalization module:
所述归一化模块,用于将所述关键词的下位相关词集中的每一个下位相关词与所述关键词的相关度均减去所述筛选阈值;以及用于将所述关键词的上位相关词集中的每一个上位相关词与所述关键词的相关度均减去所述筛选阈值,完成相关度的归一化。The normalization module is configured to subtract the relevance of each of the lower related words in the lower related word set of the keyword from the keyword by the screening threshold; and to use the keyword The correlation between each of the superordinate related words in the epistem-related word set and the keyword is subtracted from the screening threshold, and the normalization of the correlation is completed.
实施本发明实施例,具有如下有益效果:Embodiments of the present invention have the following beneficial effects:
本发明实施例提供的提供相关词的方法和装置,通过用户提供的关键词从词条数据库中获取所述关键词的下位相关词集,然后再根据该下位相关词集,求取出关键词的上位相关词集,最后该下位相关词集和该上位相关词集的并集作为所述关键词的输出相关词集,能扩展出大量的相关词提供给用户选择,另外,通过确定相关词的相关度,能准确地描述为相关词与关键词之间的相关程度,后续可依据相关词的相关度选择提供给所述用户的相关词,能通过相关词的相关度描述,准确地为用户提供相关词。The method and device for providing a related word provided by an embodiment of the present invention obtains a lower related word set of the keyword from a vocabulary database by using a keyword provided by a user, and then, according to the lower related word set, the keyword is extracted. The episode of related words, and finally the union of the lower related words and the episodes as the output related words of the keyword, can expand a large number of related words to provide for user selection, and further, by determining related words Relevance can be accurately described as the degree of correlation between related words and keywords, and subsequent related words can be selected according to the relevance of related words, and can be accurately described by the relevance of related words. Provide related words.
附图说明DRAWINGS
图1是本发明提供的提供相关词的方法的一个实施例的流程示意图;1 is a schematic flow chart of an embodiment of a method for providing related words provided by the present invention;
图2是图1提供的提供相关词的方法的步骤S2的一个实施例的流程示意图;2 is a schematic flow chart of an embodiment of step S2 of the method for providing related words provided in FIG. 1;
图3是图1提供的提供相关词的方法的步骤S3的一个实施的流程示意图;3 is a schematic flow chart of an implementation of step S3 of the method for providing related words provided by FIG. 1;
图4是本发明提供的提供相关词的方法的另一个实施例的流程示意图;4 is a schematic flow chart of another embodiment of a method for providing related words provided by the present invention;
图5是本发明提供的提供相关词的装置的一个实施例的结构示意图;FIG. 5 is a schematic structural diagram of an embodiment of an apparatus for providing related words according to the present invention; FIG.
图6是本发明提供的提供相关词的装置的上位相关词集模块的一个实施例的结构示意图;6 is a schematic structural diagram of an embodiment of a superordinate related word set module of a device for providing related words according to the present invention;
图7是本发明提供的提供相关词的装置的用于获取下位相关词集的单元的一个实施例的结构示意图;7 is a schematic structural diagram of an embodiment of a unit for acquiring a lower related word set of the apparatus for providing related words provided by the present invention;
图8是本发明提供的提供相关词的装置的待验相关词集单元的一个实施例的结构示意图;FIG. 8 is a schematic structural diagram of an embodiment of a to-be-relevant related word set unit of a device for providing related words according to the present invention; FIG.
图9是本发明提供的提供相关词的装置的对照词集单元的一个实施例的结构示意图。 FIG. 9 is a schematic structural diagram of an embodiment of a control word set unit of the apparatus for providing related words provided by the present invention.
具体实施方式detailed description
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is obvious that the described embodiments are only a part of the embodiments of the present invention, but not all embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.
参见图1、图2和图3,图1是本发明提供的提供相关词的方法的一个实施例的流程示意图,图2是图1提供的提供相关词的方法的步骤S2的一个实施例的流程示意图,图3是图1提供的提供相关词的方法的步骤S3的一个实施的流程示意图。下面将结合这三个流程图,以论文数据库(例如中国知网)作为词条数据库,从中获取关键词Java的相关词为例,详细说明本实施例的提供相关词的方法,该方法包括以下步骤:Referring to FIG. 1 , FIG. 2 and FIG. 3 , FIG. 1 is a schematic flowchart of an embodiment of a method for providing related words provided by the present invention, and FIG. 2 is an embodiment of step S2 of the method for providing related words provided by FIG. 1 . FIG. 3 is a schematic flow chart of an implementation of step S3 of the method for providing related words provided in FIG. 1 . In the following, in combination with the three flowcharts, the paper database (for example, China Knowledge Network) is used as the entry database, and the related words of the keyword Java are obtained as an example, and the method for providing related words in the embodiment is described in detail, and the method includes the following step:
S1,以用户输入的关键词Java为输入词,从词条数据库中获取关键词Java的下位相关词集,以及确定所述下位相关词集中的每一个下位相关词与所述关键词的相关度。步骤S1包括步骤S11至S13,具体如下:S1, using the keyword Java input by the user as an input word, obtaining a lower-level related word set of the keyword Java from the entry database, and determining the relevance of each lower-level related word in the lower-level related word set to the keyword . Step S1 includes steps S11 to S13, as follows:
S11,根据所述输入词Java从论文数据库中获取包含所述输入词Java的词条,并对所述词条进行分词和筛选,获得待验相关词集A={a1,…,an};此步骤的具体实施过程如下:S11. Acquire an entry containing the input word Java from the paper database according to the input word Java, and perform word segmentation and screening on the entry, and obtain a related word set A={a 1 ,..., a n }; The specific implementation process of this step is as follows:
利用搜索引擎根据所述输入词Java从论文数据库中获取包含所述输入词Java且排序在第M位前的词条,例如,前50页论文摘要作为词条,或者,在维基中搜索关键词Java的前500条摘要;Using the search engine to obtain, according to the input word Java, the entry containing the input word Java and sorted before the Mth position, for example, the first 50 pages of the abstract as a term, or searching for keywords in the wiki The first 500 abstracts of Java;
根据标准词条格式对所述词条进行格式调整;例如,将词条中的小写统一成大写、对词条中多余的空格删除、统一词条中的标点符号、将词条的全角格式或半角格式统一为一种等。Formatting the terms according to the standard entry format; for example, unifying lowercase in the entry into uppercase, deleting extra spaces in the entry, punctuation in the unified entry, or formatting the entry in full-width or The half-width format is unified into one type.
调用分词工具;优选地,所述分词工具为jieba分词工具,但不限于为此分词工具。The word segmentation tool is called; preferably, the word segmentation tool is a jieba word segmentation tool, but is not limited to this word segmentation tool.
利用所述分词工具对格式调整后的词条进行分词,获得第一词语集;Using the word segmentation tool to perform segmentation on the format-adjusted entry to obtain a first term set;
根据关键词提取算法,从所述第一词语集中提取与所述输入词相关的词语作为待验相关词{a1,…,an},获得待验相关词集A={a1,…,an}。需要说明的是,可通过分词工具或通过本提供相关词的装置添加词典,利用词典提供的核心词,从所述第一词语集中提取核心词作为待验相关词。Extracting, according to the keyword extraction algorithm, words related to the input words from the first word set as related words {a 1 , . . . , a n } to obtain a related word set A={a 1 ,... , a n }. It should be noted that the core word can be extracted from the first word set as a related word to be tested by using a word segmentation tool or by adding a dictionary through the device providing the related word.
S12,对于所述待验相关词集A={a1,…,an}中的每一个待验相关词,根据所述待验相关词从所述词条数据库中获取包含所述待验相关词的词条,并对所述待验相关词的词条进行分词和筛选,获得所述待验相关词的对照词集。需要说明的是,此步骤S22与上一个步骤S21的具体实施过程相同,只是区别在于步骤S21中的输入词变为待验相关词{a1,…,an},然后将所获得待验相关词ai的待验相关词集Bai={bi1,…,bin}作为待验相关词ai的对照词集,因而在此不 再赘述。S12. For each of the to-be-recognized related words in the set of related words A={a 1 , . . . , a n }, obtaining the to-be-tested from the entry database according to the to-be-recognized related words. The terms of the related words are segmented and screened for the words of the related words to be inspected, and the control words of the related words to be tested are obtained. It should be noted that this step S22 is the same as the specific implementation process of the previous step S21, except that the input word in step S21 becomes the related word {a 1 , . . . , a n }, and then the obtained test is obtained. a i related words of the set of words associated quarantine B ai = {b i1, ... , b in} as a control experiment to be the set of words associated word a i, and therefore will not describe them here.
S13,当判定所述待验相关词ai的对照词集Bai={bi1,…,bin}与所述待验相关词集A={a1,…,an}的交集的绝对值r大于筛选阈值p时,即Bai集合与A集合中相同元素的数量大于筛选阈值p时,所述待验相关词ai为所述输入词Java的下位相关词,获得所述关键词的下位相关词集A′={aj},且j∈{1,…,n}、|A′|≤n、|A⌒Baj|>p;其中,所述交集的绝对值r为所述下位相关词在所述下位相关词集中的相关度。需要说明的是,所述相关度表示为相关词集中的相关词与该相关词集的输入词之间的相关程度。S13, when it is determined the control word associated set of word a i B ai = the quarantine {b i1, ..., b in } be associated with the word test set A = {a 1, ..., a n} of intersection when r is greater than the absolute value of the threshold value filter p, i.e., the number of set a and set B ai same filter element is greater than the threshold value p, the quarantine related words input a i is the lower bit word Java-related words, the key is obtained The lower related word set A'={a j }, and j∈{1,...,n}, |A'|≤n, |A⌒B aj |>p; where the absolute value of the intersection r The degree of relevance of the lower related words in the lower related word set. It should be noted that the correlation degree is expressed as the degree of correlation between the related words in the related word set and the input words of the related word set.
通过上述步骤S11、S12和S13来获取输入词的下位相关词集,能滤除噪音词,提高获取下位相关词的效率。By obtaining the lower related words of the input words through the above steps S11, S12 and S13, the noise words can be filtered out, and the efficiency of acquiring the lower related words is improved.
根据所述关键词的下位相关词集,从词条数据库中获取所述关键词的上位相关词集,以及确定所述上位相关词集中的每一个上位相关词与所述关键词的相关度Obtaining a set of superordinate related words of the keyword from the vocabulary database according to the lower related word set of the keyword, and determining a correlation between each of the superordinate related words in the epistem related word set and the keyword
S2,根据所述关键词的下位相关词集A′={aj},从词条数据库中获取所述关键词的上位相关词集,以及确定所述上位相关词集中的每一个上位相关词与所述关键词的相关度。该步骤的具体包括如下步骤S21至S24:S2. Obtain a set of a superordinate related word of the keyword from the vocabulary database according to the lower related word set A′={a j } of the keyword, and determine each superordinate related word in the episode related word set. The degree of relevance to the keyword. The specific steps of this step include the following steps S21 to S24:
S21,针对所述下位相关词集A′={aj}中的每一个下位相关词,以该下位相关词aj来更新输入词,即作为输入词,从论文数据库中获取更新后的输入词aj的下位相关词集A″;需要说明的,在本实施例中,优选地,步骤S31中获取下位相关词集的方式与上述步骤S2中获取下位相关词集的方式一致,在此不再赘述。S21, for each of the lower related words in the lower related word set A'={a j }, update the input word with the lower related word a j , that is, as an input word, obtain the updated input from the paper database. lower-related words words a j set a "; Incidentally, in the present embodiment, preferably, step S31 obtain a consistent manner the bit associated word set acquires the next bit associated word set with the step S2 above embodiment, this No longer.
S22,判断当前下位相关词集的总数量N是否大于预设阈值S;S22, determining whether the total number N of the current lower related word set is greater than a preset threshold S;
S23,若是,则从下位相关词集中筛选出所有包含所述关键词的下位相关词集,并将所述包含所述关键词的下位相关词集对应的输入词作为上位相关词,获得所述关键词的上位相关词集C;其中,在所述包含所述关键词的下位相关词集中的所述关键词与该下位相关词集对应的输入词的相关度,作为该输入词在作为上位相关词时与所述关键词的相关度S23, if yes, selecting all the lower related words containing the keyword from the lower related word set, and using the input word corresponding to the lower related word set including the keyword as the upper related word, obtaining the a superordinate related word set C of the keyword; wherein, the degree of relevance of the keyword corresponding to the lower related word set in the lower related word set including the keyword is used as the upper word as the input word Relevance of relevant words with the keywords
S24,若否,则继续执行以下操作:针对更新后的输入词的下位相关词集中的每一个下位相关词,以该下位相关词再次更新输入词,从词条数据库中获取再次更新后的输入词的下位相关词集,直至下位相关词集的总数量N大于预设阈值S;需要说明的是,在步骤S21和S23中获取下位相关词集的方式与在上述步骤S1中获取下位相关词集的方式也是一致的,在此不再赘述。S24, if no, proceeding to: for each subordinate related word in the lower related word set of the updated input word, updating the input word again with the lower related word, and obtaining the updated input from the entry database The subordinate related word set of the word until the total number N of the lower related word set is greater than the preset threshold S; it should be noted that the manner of acquiring the lower related word set in steps S21 and S23 and obtaining the lower related word in the above step S1 The way of collection is also the same, and will not be repeated here.
也就是说,例如对于关键词Java来说,Java的上位相关词集是一个集合中的元素为输入词的集合,即该集合中的每一个元素的下位相关词集中都有相同的元素为Java。通过采用与获取下位相关词集相同的方式逆求取关键词的上位相关词集,能从多个维度为用户提供相关 词。That is to say, for example, for the keyword Java, the upper-level related word set of Java is a set of input words in a set, that is, the lower-level related word set of each element in the set has the same element as Java. . By retrieving the set of superordinate related words of the keyword in the same way as acquiring the lower related word set, the user can be provided with relevant information from multiple dimensions. word.
S3,将所述关键词的下位相关词集和上位相关词集的并集作为所述关键词的输出相关词集,并依据所述输出相关词集中的每一个输出相关词的相关度,在所述输出相关词集中选择提供给所述用户的相关词。S3, using a union of a lower-level related word set and a superordinate related word set of the keyword as an output related word set of the keyword, and outputting a correlation degree of the related word according to each of the output related word sets, The output related word set selects related words provided to the user.
具体地所述关键词的下位相关词集和所述上位相关词集的交集包含在所述关键词的输出相关词集中,则包含在所述交集中的每一个输出相关词的相关度为T,T=(T1+T2)/2;其中,T1为在该输出相关词作为下位相关词时与所述关键词的相关度,T2作为在该输出相关词作为上位相关词时与所述关键词的相关度。也就是说,并集后,上位相关词集和下位相关词集中相同的相关词的相关度的取值为该相关词在这两个集合中的相关度的均值。Specifically, the intersection of the lower related word set of the keyword and the episode related word set is included in the output related word set of the keyword, and the correlation degree of each output related word included in the intersection is T , T = (T1 + T2) / 2; wherein T1 is the degree of relevance to the keyword when the output related word is a lower related word, and T2 is used as the key when the output related word is a higher related word The relevance of the word. That is to say, after the union, the correlation degree of the related words of the same in the upper related word set and the lower related word set is the mean value of the correlation degree of the related words in the two sets.
作为本发明的进一步改进,所述获取方法还包括对相关度进行归一化:As a further improvement of the present invention, the obtaining method further comprises normalizing the correlation:
将所述关键词的下位相关词集中的每一个下位相关词与所述关键词的相关度均减去所述筛选阈值;And subtracting the screening threshold from each of the lower related words in the lower related word set of the keyword and the relevance of the keyword;
将所述关键词的上位相关词集中的每一个上位相关词与所述关键词的相关度均减去所述筛选阈值,完成相关度的归一化。The correlation threshold of each of the superordinate related words in the superordinate related word set of the keyword and the keyword is subtracted from the screening threshold, and the normalization of the correlation degree is completed.
需要说明的是,归一化的目的是让关键词的输出相关词集中的相关词与该关键词的相关程度的相关度的数值能以0为基准,数值越高,相关词与关键词的相关程度就越高,方便在步骤S4中在输出相关词集中选择提供给用户的相关词。It should be noted that the purpose of normalization is to make the value of the correlation between the related words of the keyword-related output words and the degree of relevance of the keyword can be based on 0, the higher the value, the related words and keywords. The higher the degree of correlation, the more convenient to select the relevant words provided to the user in the output related word set in step S4.
实施本发明实施例的提供相关词的方法,通过对获取的待验下位相关词进行对照验证后的相关词作为下位相关词,能滤除噪声词的影响,提高获取到的相关词的质量,也就是说,能确保提供给用户的相关词的准确性。另一方面,在获取到关键词的下位相关词集后,继续通过下位相关词集时进行逆求取关键词的上位相关词,能大量扩展为用户提供的相关词的数量,并能确上位相关词的质量。The method for providing a related word according to an embodiment of the present invention can filter out the influence of a noise word and improve the quality of the obtained related word by using the related word after the obtained lower-order related words to be verified as the lower related words. In other words, the accuracy of the relevant words provided to the user can be ensured. On the other hand, after obtaining the lower-level related words of the keyword, the upper-level related words of the keyword are reversely obtained when the lower-level related word set is continued, and the number of related words provided by the user can be expanded to a large extent, and the upper-level related words can be confirmed. The quality of related words.
参见图4,是本发明提供的提供相关词的方法的另一个实施例的流程示意图;本实施例的提供相关词的方法为:分别以论文数据库和维基百科数据库作为词条数据库,从中获取相应的第一输出相关词集和第二输出相关词集,然后将第一输出相关词集和第二输出相关词集的并集作为关键词的最终的输出相关词集;其中,在维基百科数据库中获取第二输出相关词集的方式与上一实施例中在论文数据库中获取输出相关词集的方式相同。本实施例采用两种不同的词条数据库且词条数据库为论文数据库和维基百科数据库,来进行相关词的挖掘,一方面对于相关词的扩展针对性强,且能避免语料单一,而导致的为用户提供的相关词获取过于片面。 Referring to FIG. 4, it is a schematic flowchart of another embodiment of a method for providing related words according to the present invention. The method for providing related words in this embodiment is as follows: using a thesis database and a Wikipedia database as the entry database respectively, and obtaining corresponding data therefrom. a first output related word set and a second output related word set, and then the union of the first output related word set and the second output related word set as a final output related word set of the keyword; wherein, in the Wikipedia database The manner in which the second output related word set is obtained is the same as the manner in which the output related word set is obtained in the paper database in the previous embodiment. In this embodiment, two different vocabulary databases are used, and the vocabulary database is a essay database and a Wikipedia database for mining related words. On the one hand, the extension of related words is highly targeted, and the corpus can be avoided. The related words provided to the user are too one-sided.
相应地,参见图5,是本发明提供的提供相关词的装置的一个实施例的结构示意图,能实现上述两种实施例的全部流程,该提供相关词的装置包括:Correspondingly, referring to FIG. 5, which is a schematic structural diagram of an embodiment of an apparatus for providing related words provided by the present invention, which can implement the entire process of the foregoing two embodiments, and the apparatus for providing related words includes:
下位相关词集模块10,用于以用户输入的关键词作为输入词,从词条数据库中获取所述关键词的下位相关词集,以及确定所述下位相关词集中的每一个下位相关词与所述关键词的相关度;The lower related word set module 10 is configured to use a keyword input by the user as an input word, obtain a lower related word set of the keyword from the entry database, and determine each lower related word in the lower related word set. The relevance of the keyword;
上位相关词集模块20,用于根据所述关键词的下位相关词集,从词条数据库中获取所述关键词的上位相关词集,以及确定所述上位相关词集中的每一个上位相关词与所述关键词的相关度;The superordinate related word set module 20 is configured to obtain a superordinate related word set of the keyword from the vocabulary database according to the lower related word set of the keyword, and determine each superordinate related word in the superordinate related word set Relevance to the keyword;
输出相关词集模块30,用于将所述关键词的下位相关词集和上位相关词集的并集作为所述关键词的输出相关词集,并依据所述输出相关词集中的每一个输出相关词的相关度,在所述输出相关词集中选择提供给所述用户的相关词。And outputting a related word set module 30, configured to use a union of a lower related word set and a superordinate related word set of the keyword as an output related word set of the keyword, and output each of the output related word sets according to the output The relevance of the related words, the related words provided to the user are selected in the output related word set.
作为本发明实施例的进一步改进,如图6所示,图6是本发明提供的提供相关词的装置的上位相关词集模块的一个实施例的结构示意图;该上位相关词集模块30具体包括:下位词集获取单元31、阈值判断单元32和上位词集获取单元33,其中,As a further improvement of the embodiment of the present invention, as shown in FIG. 6, FIG. 6 is a schematic structural diagram of an embodiment of a superordinate related word set module of the apparatus for providing related words provided by the present invention; the upper related word set module 30 specifically includes a lower word set obtaining unit 31, a threshold value determining unit 32, and a higher word set obtaining unit 33, wherein
所述下位词集获取单元31,用于针对所述下位相关词集中的每一个下位相关词,以该下位相关词来更新输入词,从词条数据库中获取更新后的输入词的下位相关词集;The lower word set obtaining unit 31 is configured to update each input word with the lower related word for each lower related word in the lower related word set, and obtain the lower related word of the updated input word from the entry database set;
所述阈值判断单元32,用于判断下位相关词集的总数量是否大于预设阈值;The threshold determining unit 32 is configured to determine whether the total number of the lower related word sets is greater than a preset threshold;
所述上位词集获取单元33,用于当判断下位相关词集的总数量大于预设阈值时,从下位相关词集中筛选出包含所述关键词的下位相关词集,并将所述包含所述关键词的下位相关词集对应的输入词作为上位相关词,获得所述关键词的上位相关词集;其中,在所述包含所述关键词的下位相关词集中的所述关键词与该下位相关词集对应的输入词的相关度,作为该输入词在作为上位相关词时与所述关键词的相关度;The upper word set obtaining unit 33 is configured to: when determining that the total number of the lower related word sets is greater than a preset threshold, select a lower related word set including the keyword from the lower related word set, and use the upper related word set The input word corresponding to the lower related word set of the keyword is used as a superordinate related word to obtain a superordinate related word set of the keyword; wherein the keyword in the lower related word set including the keyword and the keyword Correlation degree of the input word corresponding to the lower related word set, as the relevance of the input word to the keyword when it is a superordinate related word;
所述下位词集获取单元31,还用于当判断下位相关词集的总数量小于预设阈值时,继续执行以下操作:针对更新后的输入词的下位相关词集中的每一个下位相关词,以该下位相关词再次更新输入词,从词条数据库中获取再次更新后的输入词的下位相关词集,直至下位相关词集的总数量大于预设阈值。The lower word set obtaining unit 31 is further configured to: when determining that the total number of the lower related word sets is less than a preset threshold, continue to perform the following operations: for each lower related word in the lower related word set of the updated input word, The input word is updated again by the lower related word, and the lower related word set of the newly updated input word is obtained from the entry database until the total number of the lower related word set is greater than a preset threshold.
进一步地,所述下位相关词集模块20和所述下位词集获取单元31均还包括用于从词条数据库中获取下位相关词集的单元,如图7所示,图7是本发明提供的提供相关词的装置的用于获取下位相关词集的单元的一个实施例的结构示意图,具体包括:Further, the lower related word set module 20 and the lower word set obtaining unit 31 further comprise means for acquiring a lower related word set from the entry database, as shown in FIG. 7, FIG. 7 is provided by the present invention. A schematic structural diagram of an embodiment of a unit for obtaining a lower-level related word set of a device for providing related words, specifically comprising:
待验相关词集单元1,用于根据所述输入词,从词条数据库中获取包含所述输入词的词 条,并对所述词条进行分词和筛选,获得待验相关词集;The related word set unit 1 is configured to obtain, from the entry database, a word including the input word according to the input word And classifying and screening the terms to obtain a set of related words to be tested;
对照词集单元2,用于对于所述待验相关词集中的每一个待验相关词,根据所述待验相关词,从所述词条数据库中获取包含所述待验相关词的词条,并对所述待验相关词的词条进行分词和筛选,获得所述待验相关词的对照词集;和a comparison word set unit 2, configured to acquire, for each of the to-be-recognized related words in the to-be-tested related word set, an entry including the to-be-recognized related word from the entry database according to the inquiring related word And classifying and screening the terms of the related words to be inspected, and obtaining a comparison word set of the related words to be tested; and
判断获取单元3,用于当判定所述待验相关词的对照词集与所述待验相关词集的交集的绝对值大于筛选阈值时,所述待验相关词为所述输入词的下位相关词,获得下位相关词集;其中,所述绝对值作为所述下位相关词在与所述关键词的相关度。a determining obtaining unit 3, configured to: when determining that an absolute value of an intersection of the comparison word set of the to-be-tested related word and the to-be-tested related word set is greater than a screening threshold, the inquiring related word is a lower position of the input word a related word, obtaining a lower related word set; wherein the absolute value is used as a relevance of the lower related word to the keyword.
进一步地,如图8所示,图8是本发明提供的提供相关词的装置的待验相关词集单元的一个实施例的结构示意图;所述待验相关词集单元1,具体包括:Further, as shown in FIG. 8, FIG. 8 is a schematic structural diagram of an embodiment of a to-be-recognized related word set unit of the apparatus for providing related words according to the present invention;
第一词条子单元11,用于根据所述输入词,从词条数据库中获取包含所述输入词且排序在第M位前的词条;a first lexical sub-unit 11 configured to obtain, from the vocabulary database, an entry that includes the input word and is sorted before the Mth position according to the input word;
第一调整子单元12,用于根据标准词条格式,对获取的词条进行格式调整;The first adjusting sub-unit 12 is configured to perform format adjustment on the obtained entry according to the standard entry format;
第一调用子单元13,用于调用分词工具;a first calling subunit 13 for calling a word segmentation tool;
第一分词子单元14,用于利用所述分词工具对格式调整后的词条进行分词,获得第一词语集;和,a first word segment sub-unit 14 configured to perform word segmentation on the format-adjusted term by using the word segmentation tool to obtain a first word set; and,
第一提取子单元15,用于根从所述第一词语集中提取属于用户词中的核心词的词语作为待验相关词,获得待验相关词集;其中,所述用户词典是由所述分词工具提供的;a first extracting sub-unit 15, configured to extract a word belonging to a core word in the user word from the first word set as a related word to be tested, and obtain a related word set to be tested; wherein the user dictionary is Provided by the word segmentation tool;
以及,如图9所示,图9是本发明提供的提供相关词的装置的对照词集单元的一个实施例的结构示意图;所述对照词集单元2具体包括:And, as shown in FIG. 9, FIG. 9 is a schematic structural diagram of an embodiment of a control word set unit of the apparatus for providing related words according to the present invention; and the control word set unit 2 specifically includes:
第二词条子单元21,用于根据所述待验相关词,从词条数据库中获取包含所述待验相关词且排序在第M位前的词条;a second lexical sub-unit 21, configured to obtain, according to the to-be-tested related words, an entry that includes the to-be-tested related words and is sorted before the Mth position;
第二调整子单元22,用于根据所述标准词条格式,对所述包含所述待验相关词且排序在第M位前的词条进行格式调整;a second adjustment sub-unit 22, configured to perform format adjustment on the term that includes the to-be-recognized related word and is sorted before the Mth position according to the standard entry format;
第二调用子单元23,用于调用所述分词工具;a second calling subunit 23, configured to invoke the word segmentation tool;
第二分词子单元24,用于利用所述分词工具对格式调整后的包含所述待验相关词且排序在第M位前的词条进行分词,获得第二词语集;和,a second word segment sub-unit 24, configured to use the word segmentation tool to perform segmentation on the format-adjusted term containing the to-be-tested related words and sorted before the Mth position, to obtain a second word set;
第二提取子单元25,用于从所述第二词语集中属于用户词典中的核心词的词语作为对照词,获得对照词集。The second extracting sub-unit 25 is configured to obtain, from the second word set, a word belonging to the core word in the user dictionary as a control word, and obtain a control word set.
具体地,所述关键词的下位相关词集和所述上位相关词集的交集包含在所述关键词的输出相关词集中,则包含在所述交集中的每一个输出相关词的相关度为T,T=(T1+T2)/2;其中,T1为在该输出相关词作为下位相关词时与所述关键词的相关度,T2作为在该输出相关词作 为上位相关词时与所述关键词的相关度。Specifically, the intersection of the lower related word set of the keyword and the superordinate related word set is included in the output related word set of the keyword, and the correlation degree of each output related word included in the intersection is T, T = (T1 + T2) / 2; wherein T1 is the degree of correlation with the keyword when the output related word is a lower related word, and T2 is used as the relevant word in the output The degree of relevance to the keyword when it is a superordinate related word.
进一步地,如图5所示,所述提供相关词的装置还包括归一化模块40:Further, as shown in FIG. 5, the apparatus for providing related words further includes a normalization module 40:
所述归一化模块,用于将所述关键词的下位相关词集中的每一个下位相关词与所述关键词的相关度均减去所述筛选阈值;以及用于将所述关键词的上位相关词集中的每一个上位相关词与所述关键词的相关度均减去所述筛选阈值,完成相关度的归一化。The normalization module is configured to subtract the relevance of each of the lower related words in the lower related word set of the keyword from the keyword by the screening threshold; and to use the keyword The correlation between each of the superordinate related words in the epistem-related word set and the keyword is subtracted from the screening threshold, and the normalization of the correlation is completed.
本发明实施例提供的提供相关词的装置,通过对获取的待验下位相关词进行对照验证后的相关词作为下位相关词,能滤除噪声词的影响,提高获取到的相关词的质量,也就是说,能确保提供给用户的相关词的准确性。另一方面,在获取到关键词的下位相关词集后,继续通过下位相关词集时进行逆求取关键词的上位相关词,能大量扩展为用户提供的相关词的数量,并能确上位相关词的质量。The device for providing related words provided by the embodiment of the present invention can filter out the influence of the noise word and improve the quality of the obtained related word by using the related words after the obtained lower-order related words to be verified as the lower related words. In other words, the accuracy of the relevant words provided to the user can be ensured. On the other hand, after obtaining the lower-level related words of the keyword, the upper-level related words of the keyword are reversely obtained when the lower-level related word set is continued, and the number of related words provided by the user can be expanded to a large extent, and the upper-level related words can be confirmed. The quality of related words.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,所述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)或随机存储记忆体(Random Access Memory,RAM)等。One of ordinary skill in the art can understand that all or part of the process of implementing the foregoing embodiments can be completed by a computer program to instruct related hardware, and the program can be stored in a computer readable storage medium. When executed, the flow of an embodiment of the methods as described above may be included. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), or a random access memory (RAM).
以上所述是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也视为本发明的保护范围。 The above is a preferred embodiment of the present invention, and it should be noted that those skilled in the art can also make several improvements and retouchings without departing from the principles of the present invention. It is the scope of protection of the present invention.

Claims (11)

  1. 一种提供相关词的方法,其特征在于,包括:A method for providing related words, comprising:
    以用户输入的关键词作为输入词,从词条数据库中获取所述关键词的下位相关词集,以及确定所述下位相关词集中的每一个下位相关词与所述关键词的相关度;Taking a keyword input by the user as an input word, acquiring a lower related word set of the keyword from the vocabulary database, and determining a relevance of each lower related word in the lower related word set to the keyword;
    根据所述关键词的下位相关词集,从词条数据库中获取所述关键词的上位相关词集,以及确定所述上位相关词集中的每一个上位相关词与所述关键词的相关度;Obtaining, according to the lower-level related word set of the keyword, the episode related words of the keyword from the vocabulary database, and determining the relevance of each of the superordinate related words in the superordinate related word set to the keyword;
    将所述关键词的下位相关词集和上位相关词集的并集作为所述关键词的输出相关词集,并依据所述输出相关词集中的每一个输出相关词的相关度,在所述输出相关词集中选择提供给所述用户的相关词。Taking the union of the lower related word set and the superordinate related word set of the keyword as the output related word set of the keyword, and outputting the relevance degree of the related word according to each of the output related word sets, The relevant related words are selected to select related words provided to the user.
  2. 如权利要求1所述的提供相关词的方法,其特征在于,所述根据所述关键词的下位相关词集,从词条数据库中获取所述关键词的上位相关词集,以及确定所述上位相关词集中的每一个上位相关词与所述关键词的相关度,具体为:The method for providing a related word according to claim 1, wherein said obtaining a set of a superordinate related word of said keyword from said entry database according to said lower related word set of said keyword, and determining said The correlation between each of the superordinate related words in the epistem related words and the keywords is specifically:
    针对所述下位相关词集中的每一个下位相关词,以该下位相关词来更新输入词,从词条数据库中获取更新后的输入词的下位相关词集;And for each lower-level related word in the lower-level related word set, the input word is updated by the lower-level related word, and the lower-level related word set of the updated input word is obtained from the entry database;
    判断下位相关词集的总数量是否大于预设阈值;Determining whether the total number of the lower related words is greater than a preset threshold;
    若是,则从下位相关词集中筛选出包含所述关键词的下位相关词集,并将所述包含所述关键词的下位相关词集对应的输入词作为上位相关词,获得所述关键词的上位相关词集;其中,在所述包含所述关键词的下位相关词集中的所述关键词与该下位相关词集对应的输入词的相关度,作为该输入词在作为上位相关词时与所述关键词的相关度;If yes, the lower related word set including the keyword is selected from the lower related word set, and the input word corresponding to the lower related word set including the keyword is used as the upper related word to obtain the keyword. a correlation word set; wherein, the degree of relevance of the keyword corresponding to the lower-level related word set in the lower-level related word set including the keyword, as the input word is used as a superordinate related word The relevance of the keyword;
    若否,则继续执行以下操作:针对更新后的输入词的下位相关词集中的每一个下位相关词,以该下位相关词再次更新输入词,从词条数据库中获取再次更新后的输入词的下位相关词集,直至下位相关词集的总数量大于预设阈值。If not, proceeding to: for each subordinate related word in the lower related word set of the updated input word, updating the input word again with the lower related word, and obtaining the updated input word from the entry database The lower related word set until the total number of lower related words is greater than a preset threshold.
  3. 如权利要求2所述的提供相关词的方法,其特征在于,从词条数据库中获取下位相关词集的方式具体包括:The method for providing a related word according to claim 2, wherein the manner of obtaining the lower related word set from the entry database specifically includes:
    根据所述输入词,从词条数据库中获取包含所述输入词的词条,并对所述词条进行分词和筛选,获得待验相关词集;Obtaining an entry containing the input word from the entry database according to the input word, and performing word segmentation and screening on the entry to obtain a related word set to be tested;
    对于所述待验相关词集中的每一个待验相关词,根据所述待验相关词,从所述词条数据库中获取包含所述待验相关词的词条,并对所述待验相关词的词条进行分词和筛选,获得所 述待验相关词的对照词集;For each of the to-be-recognized related words in the to-be-tested related word set, according to the to-be-recognized related words, an entry containing the to-be-recognized related words is obtained from the entry database, and the related to-be-tested correlation is Word entry for word segmentation and screening a set of comparison words for the relevant words to be examined;
    当判定所述待验相关词的对照词集与所述待验相关词集的交集的绝对值大于筛选阈值时,所述待验相关词为所述输入词的下位相关词,获得下位相关词集;其中,所述绝对值作为所述下位相关词与所述关键词的相关度。When it is determined that the absolute value of the intersection of the comparison word set of the to-be-tested related word and the to-be-tested related word set is greater than a screening threshold, the to-be-relevant related word is a lower related word of the input word, and obtains a lower related word a set; wherein the absolute value is used as a correlation between the lower related word and the keyword.
  4. 如权利要求3所述的提供相关词的方法,其特征在于,所述根据所述输入词,从词条数据库中获取包含所述输入词的词条,并对所述词条进行分词和筛选,获得待验相关词集,具体包括:The method for providing related words according to claim 3, wherein said obtaining an entry containing said input word from said entry database according to said input word, and classifying and screening said entry , obtaining the relevant words to be tested, including:
    根据所述输入词,从词条数据库中获取包含所述输入词且排序在第M位前的词条;Obtaining, according to the input word, an entry that includes the input word and is sorted before the Mth position;
    根据标准词条格式,对获取的词条进行格式调整;Formatting the acquired terms according to the standard entry format;
    调用分词工具;Call the word segmentation tool;
    利用所述分词工具对格式调整后的词条进行分词,获得第一词语集;Using the word segmentation tool to perform segmentation on the format-adjusted entry to obtain a first term set;
    从所述第一词语集中提取属于用户词典中的核心词的词语作为待验相关词,获得待验相关词集;其中,所述用户词典是由所述分词工具提供的;Extracting, from the first word set, a word belonging to a core word in the user dictionary as a related word to be tested, and obtaining a related word set; wherein the user dictionary is provided by the word segmentation tool;
    以及,所述根据所述待验相关词,从所述词条数据库中获取包含所述待验相关词的词条,并对所述待验相关词的词条进行分词和筛选,获得所述待验相关词的对照词集,具体包括:And obtaining, according to the to-be-recognized related words, an entry including the to-be-recognized related word from the entry database, and performing word segmentation and screening on the inquiring related term to obtain the The control word set of the relevant words to be tested includes:
    根据所述待验相关词,从词条数据库中获取包含所述待验相关词且排序在第M位前的词条;Obtaining, according to the inquiring related words, an entry that includes the to-be-tested related words and is sorted before the Mth position;
    根据所述标准词条格式,对所述包含所述待验相关词且排序在第M位前的词条进行格式调整;Formatting the entry containing the to-be-tested related words and sorting before the Mth position according to the standard entry format;
    调用所述分词工具;Calling the word segmentation tool;
    利用所述分词工具对格式调整后的包含所述待验相关词且排序在第M位前的词条进行分词,获得第二词语集;Using the word segmentation tool to perform segmentation on the format-adjusted term containing the to-be-tested related words and sorted before the Mth position, to obtain a second set of words;
    从所述第二词语集中提取属于用户词典中的核心词的词语作为对照词,获得对照词集。A word belonging to the core word in the user dictionary is extracted from the second word set as a control word, and a control word set is obtained.
  5. 如权利要求1所述的提供相关词的方法,其特征在于,所述关键词的下位相关词集和所述上位相关词集的交集包含在所述关键词的输出相关词集中,则包含在所述交集中的每一个输出相关词的相关度为T,T=(T1+T2)/2;其中,T1为在该输出相关词作为下位相关词时与所述关键词的相关度,T2作为在该输出相关词作为上位相关词时与所述关键词的相关度。The method for providing a related word according to claim 1, wherein an intersection of the lower related word set of the keyword and the upper related word set is included in an output related word set of the keyword, and is included in The correlation degree of each output related word in the intersection is T, T=(T1+T2)/2; wherein T1 is the correlation with the keyword when the output related word is a lower related word, T2 The degree of correlation with the keyword when the related word is output as a superordinate related word.
  6. 如权利要求3所述的提供相关词的方法,其特征在于,所述获取方法还包括: The method for providing a related word according to claim 3, wherein the obtaining method further comprises:
    将所述关键词的下位相关词集中的每一个下位相关词与所述关键词的相关度均减去所述筛选阈值;And subtracting the screening threshold from each of the lower related words in the lower related word set of the keyword and the relevance of the keyword;
    将所述关键词的上位相关词集中的每一个上位相关词与所述关键词的相关度均减去所述筛选阈值,完成相关度的归一化。The correlation threshold of each of the superordinate related words in the superordinate related word set of the keyword and the keyword is subtracted from the screening threshold, and the normalization of the correlation degree is completed.
  7. 一种提供相关词的装置,其特征在于,包括:A device for providing related words, comprising:
    下位相关词集模块,用于以用户输入的关键词作为输入词,从词条数据库中获取所述关键词的下位相关词集,以及确定所述下位相关词集中的每一个下位相关词与所述关键词的相关度;a lower-level related word set module, configured to use a keyword input by a user as an input word, obtain a lower-level related word set of the keyword from a vocabulary database, and determine each lower-level related word and place in the lower-level related word set The relevance of the keywords;
    上位相关词集模块,用于根据所述关键词的下位相关词集,从词条数据库中获取所述关键词的上位相关词集,以及确定所述上位相关词集中的每一个上位相关词与所述关键词的相关度;a superordinate related word set module, configured to acquire a superordinate related word set of the keyword from a vocabulary database according to a lower related word set of the keyword, and determine each superordinate related word in the episode related word set The relevance of the keyword;
    输出相关词集模块,用于将所述关键词的下位相关词集和上位相关词集的并集作为所述关键词的输出相关词集,并依据所述输出相关词集中的每一个输出相关词的相关度,在所述输出相关词集中选择提供给所述用户的相关词。And outputting a related word set module, configured to use a union of a lower related word set and a superordinate related word set of the keyword as an output related word set of the keyword, and output related according to each of the output related word sets The relevance of the word, the related words provided to the user are selected in the output related word set.
  8. 如权利要求7所述的提供相关词的装置,其特征在于,所述上位相关词集模块具体包括:下位词集获取单元、阈值判断单元和上位词集获取单元,其中,The device for providing a related word according to claim 7, wherein the upper related word set module comprises: a lower word set obtaining unit, a threshold value determining unit, and a higher word set obtaining unit, wherein
    所述下位词集获取单元,用于针对所述下位相关词集中的每一个下位相关词,以该下位相关词来更新输入词,从词条数据库中获取更新后的输入词的下位相关词集;The lower word set obtaining unit is configured to update each input word with the lower related word for each lower related word in the lower related word set, and obtain the lower related word set of the updated input word from the entry database ;
    所述阈值判断单元,用于判断下位相关词集的总数量是否大于预设阈值;The threshold determining unit is configured to determine whether the total number of the lower related word sets is greater than a preset threshold;
    所述上位词集获取单元,用于当判断下位相关词集的总数量大于预设阈值时,从下位相关词集中筛选出包含所述关键词的下位相关词集,并将所述包含所述关键词的下位相关词集对应的输入词作为上位相关词,获得所述关键词的上位相关词集;其中,在所述包含所述关键词的下位相关词集中的所述关键词与该下位相关词集对应的输入词的相关度,作为该输入词在作为上位相关词时与所述关键词的相关度;The upper word set obtaining unit is configured to: when determining that the total number of the lower related word sets is greater than a preset threshold, select a lower related word set including the keyword from the lower related word set, and include the The input word corresponding to the lower-level related word set of the keyword is used as a superordinate related word to obtain a superordinate related word set of the keyword; wherein the keyword and the lower position in the lower-level related word set including the keyword Correlation degree of the input word corresponding to the related word set, as the relevance of the input word to the keyword when it is a superordinate related word;
    所述下位词集获取单元,还用于当判断下位相关词集的总数量小于预设阈值时,继续执行以下操作:针对更新后的输入词的下位相关词集中的每一个下位相关词,以该下位相关词再次更新输入词,从词条数据库中获取再次更新后的输入词的下位相关词集,直至下位相关词集的总数量大于预设阈值。 The lower word set obtaining unit is further configured to: when determining that the total number of the lower related word sets is less than a preset threshold, continue to perform the following operations: for each of the lower related words in the lower related word set of the updated input words, The lower related word updates the input word again, and obtains the lower related word set of the newly updated input word from the entry database until the total number of the lower related word set is greater than a preset threshold.
  9. 如权利要求8所述的提供相关词的装置,其特征在于,所述下位相关词集模块和所述下位词集获取单元还包括用于从词条数据库中获取下位相关词集的单元,具体为:The apparatus for providing related words according to claim 8, wherein the lower related word set module and the lower word set obtaining unit further comprise means for acquiring a lower related word set from the entry database, for:
    待验相关词集单元,用于根据所述输入词,从词条数据库中获取包含所述输入词的词条,并对所述词条进行分词和筛选,获得待验相关词集;The related word set unit is configured to obtain an entry including the input word from the vocabulary database according to the input word, and perform word segmentation and screening on the term, and obtain a related word set to be tested;
    对照词集单元,用于对于所述待验相关词集中的每一个待验相关词,根据所述待验相关词,从所述词条数据库中获取包含所述待验相关词的词条,并对所述待验相关词的词条进行分词和筛选,获得所述待验相关词的对照词集;和a comparison word set unit, configured to acquire, for each of the to-be-recognized related words in the to-be-tested related word set, an entry including the to-be-tested related word from the entry database according to the inquiring related word, And performing word segmentation and screening on the words of the related words to be inspected, and obtaining a comparison word set of the related words to be tested; and
    判断获取单元,用于当判定所述待验相关词的对照词集与所述待验相关词集的交集的绝对值大于筛选阈值时,所述待验相关词为所述输入词的下位相关词,获得下位相关词集;其中,所述绝对值作为所述下位相关词与所述关键词的相关度。a judgment obtaining unit, configured to determine, when the absolute value of the intersection of the control word set of the to-be-tested related word and the to-be-tested related word set is greater than a screening threshold, the in-relevant related word is a lower correlation of the input word a word, obtaining a lower related word set; wherein the absolute value is used as a relevance of the lower related word to the keyword.
  10. 如权利要求9所述的提供相关词的装置,其特征在于,所述待验相关词集单元,具体包括:The apparatus for providing a related word according to claim 9, wherein the unit of the related words to be examined specifically includes:
    第一词条子单元,用于根据所述输入词,从词条数据库中获取包含所述输入词且排序在第M位前的词条;a first lexical sub-unit, configured to obtain, from the vocabulary database, an entry that includes the input word and is sorted before the Mth position according to the input word;
    第一调整子单元,用于根据标准词条格式,对获取的词条进行格式调整;a first adjustment subunit, configured to perform format adjustment on the obtained entry according to a standard entry format;
    第一调用子单元,用于调用分词工具;The first call subunit is used to call the word segmentation tool;
    第一分词子单元,用于利用所述分词工具对格式调整后的词条进行分词,获得第一词语集;和,a first word segment sub-unit, configured to perform word segmentation on the format-adjusted term by using the word segmentation tool to obtain a first word set; and,
    第一提取子单元,用于从所述第一词语集中提取属于用户词中的核心词的词语作为待验相关词,获得待验相关词集;其中,所述用户词典是由所述分词工具提供的;a first extracting subunit, configured to extract a word belonging to a core word in a user word from the first word set as a related word to be tested, to obtain a related word set to be tested; wherein the user dictionary is a word segmentation tool which provided;
    以及,所述对照词集单元具体包括:And the control word set unit specifically includes:
    第二词条子单元,用于根据所述待验相关词,从词条数据库中获取包含所述待验相关词且排序在第M位前的词条;a second term sub-unit, configured to obtain, according to the to-be-tested related words, an entry that includes the to-be-tested related words and is sorted before the Mth position;
    第二调整子单元,用于根据所述标准词条格式,对所述包含所述待验相关词且排序在第M位前的词条进行格式调整;a second adjustment subunit, configured to perform format adjustment on the term that includes the to-be-recognized related word and is sorted before the Mth position according to the standard entry format;
    第二调用子单元,用于调用所述分词工具;a second calling subunit, configured to invoke the word segmentation tool;
    第二分词子单元,用于利用所述分词工具对格式调整后的包含所述待验相关词且排序在第M位前的词条进行分词,获得第二词语集;和,a second word segment sub-unit, configured to perform word segmentation on the format-adjusted term containing the to-be-tested related words and sorted before the Mth position by using the word segmentation tool to obtain a second word set;
    第二提取子单元,用于根据从所述第二词语集中提取属于用户词典中的核心词的词语作为对照词,获得对照词集。 a second extraction subunit, configured to obtain a comparison word set according to a word extracted from the second word set and belonging to a core word in the user dictionary as a control word.
  11. 如权利要求10所述的提供相关词的装置,其特征在于,所述提供相关词的装置还包括归一化模块:The apparatus for providing related words according to claim 10, wherein the means for providing related words further comprises a normalization module:
    所述归一化模块,用于将所述关键词的下位相关词集中的每一个下位相关词与所述关键词的相关度均减去所述筛选阈值;以及用于将所述关键词的上位相关词集中的每一个上位相关词与所述关键词的相关度均减去所述筛选阈值,完成相关度的归一化。 The normalization module is configured to subtract the relevance of each of the lower related words in the lower related word set of the keyword from the keyword by the screening threshold; and to use the keyword The correlation between each of the superordinate related words in the epistem-related word set and the keyword is subtracted from the screening threshold, and the normalization of the correlation is completed.
PCT/CN2016/113175 2016-06-17 2016-12-29 Method and device for providing relevant words WO2017215244A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610445489.2 2016-06-17
CN201610445489.2A CN106126588B (en) 2016-06-17 2016-06-17 The method and apparatus of related term are provided

Publications (1)

Publication Number Publication Date
WO2017215244A1 true WO2017215244A1 (en) 2017-12-21

Family

ID=57470913

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/113175 WO2017215244A1 (en) 2016-06-17 2016-12-29 Method and device for providing relevant words

Country Status (2)

Country Link
CN (1) CN106126588B (en)
WO (1) WO2017215244A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106126588B (en) * 2016-06-17 2019-09-20 广州视源电子科技股份有限公司 The method and apparatus of related term are provided
CN108304366B (en) * 2017-03-21 2020-04-03 腾讯科技(深圳)有限公司 Hypernym detection method and device
CN108628832B (en) * 2018-05-08 2022-03-18 中国联合网络通信集团有限公司 Method and device for acquiring information keywords
CN109241525B (en) * 2018-08-20 2022-05-06 深圳追一科技有限公司 Keyword extraction method, device and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101251844A (en) * 2007-02-21 2008-08-27 富士胶片株式会社 Apparatus and method for retrieval of contents
US20120072443A1 (en) * 2010-09-21 2012-03-22 Inventec Corporation Data searching system and method for generating derivative keywords according to input keywords
CN103778262A (en) * 2014-03-06 2014-05-07 北京林业大学 Information retrieval method and device based on thesaurus
CN104008097A (en) * 2013-02-21 2014-08-27 日电(中国)有限公司 Method and device for achieving query understanding
CN106126588A (en) * 2016-06-17 2016-11-16 广州视源电子科技股份有限公司 The method and apparatus that related term is provided

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103810274B (en) * 2014-02-12 2017-03-29 北京联合大学 Multi-characteristic image tag sorting method based on WordNet semantic similarities
CN104123351B (en) * 2014-07-09 2017-08-25 百度在线网络技术(北京)有限公司 Interactive method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101251844A (en) * 2007-02-21 2008-08-27 富士胶片株式会社 Apparatus and method for retrieval of contents
US20120072443A1 (en) * 2010-09-21 2012-03-22 Inventec Corporation Data searching system and method for generating derivative keywords according to input keywords
CN104008097A (en) * 2013-02-21 2014-08-27 日电(中国)有限公司 Method and device for achieving query understanding
CN103778262A (en) * 2014-03-06 2014-05-07 北京林业大学 Information retrieval method and device based on thesaurus
CN106126588A (en) * 2016-06-17 2016-11-16 广州视源电子科技股份有限公司 The method and apparatus that related term is provided

Also Published As

Publication number Publication date
CN106126588B (en) 2019-09-20
CN106126588A (en) 2016-11-16

Similar Documents

Publication Publication Date Title
US11263262B2 (en) Indexing a dataset based on dataset tags and an ontology
US8892420B2 (en) Text segmentation with multiple granularity levels
JP7211045B2 (en) Abstract generation method, abstract generation program, and abstract generation device
WO2017215244A1 (en) Method and device for providing relevant words
WO2021098648A1 (en) Text recommendation method, apparatus and device, and medium
CN107463548B (en) Phrase mining method and device
RU2547213C2 (en) Assigning actionable attributes to data describing personal identity
CN102662936B (en) Chinese-English unknown words translating method blending Web excavation, multi-feature and supervised learning
US11907659B2 (en) Item recall method and system, electronic device and readable storage medium
CN103678684A (en) Chinese word segmentation method based on navigation information retrieval
WO2017161899A1 (en) Text processing method, device, and computing apparatus
WO2017113592A1 (en) Model generation method, word weighting method, apparatus, device and computer storage medium
WO2018090468A1 (en) Method and device for searching for video program
CN109101551B (en) Question-answer knowledge base construction method and device
CN103313248A (en) Method and device for identifying junk information
CN110276079B (en) Word stock establishment method, information retrieval method and corresponding system
KR101811565B1 (en) System for providing an expert answer to a natural language question
CN111324705A (en) System and method for adaptively adjusting related search terms
CN110929509B (en) Domain event trigger word clustering method based on louvain community discovery algorithm
WO2024078141A1 (en) Subject-based document retrieval prediction method
US10229105B1 (en) Mobile log data parsing
US20190213486A1 (en) Virtual Adaptive Learning of Financial Articles Utilizing Artificial Intelligence
JP2012141681A (en) Query segment position determining device
KR100659370B1 (en) Method for constructing a document database and method for searching information by matching thesaurus
CN113779987A (en) Event co-reference disambiguation method and system based on self-attention enhanced semantics

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16905351

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16905351

Country of ref document: EP

Kind code of ref document: A1