WO2017096819A1 - 一种同义词数据挖掘方法和系统 - Google Patents

一种同义词数据挖掘方法和系统 Download PDF

Info

Publication number
WO2017096819A1
WO2017096819A1 PCT/CN2016/088681 CN2016088681W WO2017096819A1 WO 2017096819 A1 WO2017096819 A1 WO 2017096819A1 CN 2016088681 W CN2016088681 W CN 2016088681W WO 2017096819 A1 WO2017096819 A1 WO 2017096819A1
Authority
WO
WIPO (PCT)
Prior art keywords
synonym
vocabulary
similarity
pair
candidate
Prior art date
Application number
PCT/CN2016/088681
Other languages
English (en)
French (fr)
Inventor
李建南
Original Assignee
乐视控股(北京)有限公司
乐视网信息技术(北京)股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 乐视控股(北京)有限公司, 乐视网信息技术(北京)股份有限公司 filed Critical 乐视控股(北京)有限公司
Priority to US15/242,271 priority Critical patent/US20170169012A1/en
Publication of WO2017096819A1 publication Critical patent/WO2017096819A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/01Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/045Explanation of inference; Explainable artificial intelligence [XAI]; Interpretable artificial intelligence

Definitions

  • the invention relates to the field of media communication technology, in particular to a synonym data mining method and system.
  • Data mining is generally a process of automatically searching for information with special relationships hidden from a large amount of data.
  • Data mining is usually related to computer science, and through statistics, online analytical processing, information retrieval, machine learning, expert systems and Pattern recognition and many other methods are implemented.
  • the keyword can be input and all related content can be retrieved according to the keyword.
  • the network retrieval application can only retrieve the content with the same keyword, so that the retrieval scope is small and cannot satisfy the user's retrieval requirement.
  • the input keyword is inaccurate, the target content to be retrieved may not be retrieved, and the existing web search application requires a large amount of time to determine the keyword, so that the user experience is very poor. Therefore, a synonym dictionary library is urgently needed in the current web search application, so that it can retrieve more content.
  • the purpose of the embodiments of the present invention is to provide a synonym data mining method and system, which solves the problem that the network search application can only retrieve content with the same keyword in the prior art.
  • the lexical pair in the dictionary and the similarity value of the lexical pair are obtained by encoding all the vocabulary in the dictionary, and the vocabulary appearing in the lexical interpretation is used as a preliminary synonym vector, and then arranged according to a tree structure.
  • the vocabulary is used as the parent node, and its preliminary synonym vector is used as the child node, and the cosine similarity algorithm of the vector is used to calculate the similarity between each vocabulary and the corresponding pre-synonym vector;
  • the vocabulary pair in the video file library and the similarity value of the vocabulary pair are added to the other party's preliminary synonym vector by extracting the title of the video in a preset video file library, and the vocabulary appearing in the same title is added to each other's preliminary synonym vector.
  • the similarity between the vocabulary and its corresponding presynonym vector is calculated. Where count(w1) is the number of titles that appear in w1, count(w2) is the number of titles that appear in w2, and count(w1, w2) is the number of simultaneous occurrences of w1 and w2 in the same title;
  • words appearing in the same query request and words having the same search result in different query requests are mutually prepared synonym vectors; for the word w1 and the synonym w2 corresponding to w1, Calculate the similarity between each vocabulary and its corresponding presynonym vector Where count(w1) is the number of queries that appear in w1, count(w2) is the number of queries that appear in w2, and count(w1, w2) is the number of simultaneous occurrences of w1 and w2 in the same query, same(w1, w2) For w1, w2 in different queries but searched for the same number of results.
  • the method before the establishing a lexical pair of candidate thesaurus associated with the similarity value, the method further comprises: adding the similarity values of each vocabulary pair in the dictionary, the video file library, and the search log record. Average value and stored in the candidate thesaurus;
  • the candidate synonym database is represented as (w1, w2, T1, T2, T3, T), where T1 is the similarity value of the vocabulary pair w1, w2 in the dictionary, and T2 is the vocabulary pair w1, w2 in the video file. Similarity values in the library, T3 is the similarity value of the vocabulary pair w1, w2 in the search log record, and T is the average value of the vocabulary pair w1, w2 similarity.
  • the training and obtaining the synonym model comprises: extracting the first to nth pieces of data information (w1, w2, T) from the candidate synonym database as an input, and extracting the n+1 from the candidate synonym database.
  • the stripe to the 2nth piece of data information (w1, w2, T) is used as an output to train the gradient decision tree model;
  • ⁇ 1 - ⁇ m is m decision trees
  • ⁇ 1 - ⁇ m is the weight of each tree
  • T is the average of the similarity values of the three vectors corresponding to each pair of words.
  • the substituting the similarity value of each vocabulary pair in the candidate synonym into the synonym model is to substitute the corresponding similarity value of each vocabulary pair in the candidate synonym database into the synonym gradient lifting decision tree model. Obtaining the output value of the synonym gradient elevation decision tree model.
  • an embodiment of the present invention further provides a synonym data mining system, including:
  • a candidate synonym database establishing unit configured to obtain a lexical pair in a dictionary, a video file library, and a search log record, and a similarity value of the vocabulary pair, and establish a candidate synonym database in which the vocabulary pair is associated with the similarity value;
  • a synonym model establishing unit for training and obtaining a synonym model according to data information in the candidate synonym database
  • the thesaurus establishing unit is configured to substitute the similarity value corresponding to each vocabulary in the candidate synonym database into the synonym model to obtain an output value; and store the lexical pair whose output value is greater than a preset threshold value in the thesaurus.
  • the candidate synonym database establishes a lexical pair in the dictionary and a similarity value of the lexical pair, by encoding all the vocabulary in the dictionary, using the vocabulary appearing in the lexical interpretation as a preliminary synonym vector, and then following The tree structure is arranged, the vocabulary is used as the parent node, and its preliminary synonym vector is used as a child node, and the cosine similarity algorithm of the vector is used to calculate the similarity of each vocabulary with each corresponding presynonym vector;
  • the vocabulary pair in the video file library and the similarity value of the vocabulary pair by extracting the title of the video in a preset video file library, the vocabulary appearing in the same title is added to each other's preliminary synonym vector;
  • the vocabulary w1 and the synonym w2 corresponding to w1 calculate the similarity between the vocabulary and each of the corresponding synonym vectors corresponding thereto Where count(w1) is the number of titles that appear in w1, count(w2) is the number of titles that appear in w2, and count(w1, w2) is the number of simultaneous occurrences of w1 and w2 in the same title;
  • the vocabulary appearing in the same query request and the vocabulary in the same query request but the same search result are mutually prepared synonym vectors; for the vocabulary w1 and the synonym w2 corresponding to w1, the vocabulary is calculated.
  • the similarity of each preliminary synonym vector corresponding to it Where count(w1) is the number of queries that appear in w1, count(w2) is the number of queries that appear in w2, and count(w1, w2) is the number of simultaneous occurrences of w1 and w2 in the same query, same(w1, w2) For w1, w2 in different queries but searched for the same number of results.
  • the candidate synonym database establishing unit is further configured to average the similarity values of each vocabulary in the dictionary, the video file library, and the search log record, and store the same in the candidate synonym database;
  • the candidate synonym database is represented as (w1, w2, T1, T2, T3, T), where T1 is the similarity value of the vocabulary pair w1, w2 in the dictionary, and T2 is the vocabulary pair w1, w2 in the video file.
  • the similarity value in the library T3 is the similarity value of the vocabulary pair w1 and w2 in the search log record, and T is the average value of the similarity of the vocabulary pair w1 and w2.
  • the synonym model establishing unit training and obtaining the synonym model includes: extracting the first to nth pieces of data information (w1, w2, T) from the candidate synonym database as input, and extracting from the candidate synonym database.
  • the n+1th to 2nth piece of data information (w1, w2, T) is used as an output to train the gradient decision tree model;
  • ⁇ 1 - ⁇ m is m decision trees
  • ⁇ 1 - ⁇ m is the weight of each tree
  • T is the average of the similarity values of the three vectors corresponding to each pair of words.
  • the thesaurus establishing unit substitutes the similarity value of each vocabulary pair in the candidate synonym database into the synonym model by substituting the corresponding similarity value of each vocabulary pair in the candidate synonym database into the synonym gradient elevation In the decision tree model, the output value of the synonym gradient elevation decision tree model is obtained.
  • the embodiment of the present application further provides a computer storage medium, where the computer storage medium can store a program, and when the program is executed, some or all of the steps of the foregoing synonym data mining methods can be implemented.
  • the synonym data mining method and system establish a vocabulary pair by acquiring lexical pairs in a dictionary, a video file library, and a search log record, and a similarity value of the vocabulary pair.
  • a candidate synonym library associated with the similarity value; training and obtaining a synonym model according to the data information in the candidate synonym database; substituting the similarity value corresponding to each vocabulary in the candidate synonym database into the synonym model to obtain an output value;
  • a lexical pair whose value is greater than a preset threshold is stored in the thesaurus.
  • FIG. 1 is a schematic flow chart of a synonym data mining method in a first embodiment of the present invention
  • FIG. 2 is a schematic flow chart of a synonym data mining method according to an embodiment of the present invention.
  • FIG. 3 is a schematic structural diagram of a synonym data mining system according to the present invention.
  • the present invention recognizes from the user's point of view that the user desires to be able to retrieve more content on the web search application. Therefore, the idea of the present invention is to set a synonym retrieval function on a network retrieval application.
  • FIG. 1 is a schematic flowchart of a synonym data mining method according to a first embodiment of the present invention, where the synonym data mining method includes:
  • Step 101 Acquire a vocabulary pair in the dictionary, the video file library, and the search log record, and a similarity value of the vocabulary pair, and establish a lexical pair candidate synonym database associated with the similarity value.
  • the preliminary synonym database is established based on the dictionary, and the associated lexical pair and the similarity value of the vocabulary pair are stored in the dictionary preparation thesaurus.
  • the vocabulary appearing in the lexical interpretation is used as a preliminary synonym vector by encoding all the vocabulary in the dictionary. Then, it is arranged in a tree structure, the vocabulary is taken as a parent node, and its preliminary synonym vector is used as a child node.
  • the cosine similarity algorithm of the vector is used to calculate the similarity of each vocabulary to the corresponding pre-synonym vector.
  • the preliminary synonym database is established based on the video file, and the associated vocabulary pair and the similarity value of the vocabulary pair are stored in the video file preparation thesaurus.
  • the vocabulary appearing in the same title is added to the other party's preliminary synonym vector by extracting the title of the video in a preset video file library; for the vocabulary w1 and the synonym w2 corresponding to w1, Calculate the similarity between each vocabulary and its corresponding presynonym vector Where count(w1) is the number of titles that appear in w1, count(w2) is the number of titles that appear in w2, and count(w1, w2) is the number of simultaneous occurrences of w1 and w2 in the same title.
  • a preliminary synonym database is created based on the search log, and the associated vocabulary pair and the similarity value of the vocabulary pair are stored in the search log preparation thesaurus.
  • the vocabulary appearing in the same query request and the vocabulary that is the same in different query requests but with the same search result are each other's preliminary synonym vector; for the vocabulary w1 and the synonym w2 corresponding to w1, the vocabulary is calculated The similarity of each prepared synonym vector Where count(w1) is the number of queries that appear in w1, count(w2) is the number of queries that appear in w2, and count(w1, w2) is the number of simultaneous occurrences of w1 and w2 in the same query, same(w1, w2) For w1, w2 in different queries but searched for the same number of results.
  • all vocabulary pairs having a preliminary synonym relationship in the dictionary preparation synonym library, the video file preparation synonym library, and the search log preparation thesaurus are obtained.
  • the corresponding similarity values in each of the vocabulary pairs in the dictionary preparation thesaurus, the video file preparation thesaurus, and the search log preparation thesaurus are extracted. Then, a candidate thesaurus is created.
  • each vocabulary is paired in a dictionary, a video file library, and a search log record.
  • the similarity values are summed and averaged and stored in the candidate thesaurus. Therefore, the candidate synonym database is represented as (w1, w2, T1, T2, T3, T), where T1 is the similarity value of the vocabulary pair w1, w2 in the dictionary, and T2 is the similarity of the vocabulary pair w1, w2 in the video file library.
  • T3 is the similarity value of the vocabulary pair w1, w2 in the search log record, and T is the average value of the vocabulary pair w1, w2 similarity.
  • Step 102 Train and obtain a synonym model according to the data information in the candidate synonym database.
  • the first to nth pieces of data information (w1, w2, T) are extracted from the candidate synonym database as input, and the n+1th to 2nth piece of data information is extracted from the candidate synonym database (w1, w2) , T)
  • the training gradient is promoted to the decision tree model.
  • F(T) ⁇ 1 ⁇ 1 (T)+ ⁇ 2 ⁇ 2 (T)+...+ ⁇ m ⁇ m (T)
  • ⁇ 1 - ⁇ m is m decision trees
  • ⁇ 1 - ⁇ m is the weight of each tree
  • T is the average of the similarity values of the three vectors corresponding to each pair of words.
  • Step 103 Substituting the similarity value corresponding to each vocabulary in the candidate thesaurus into the synonym model, determining whether the obtained output value is greater than a preset threshold; if greater than, the lexical pair corresponding to the output value is from the candidate synonym database. Extracted, stored in the thesaurus; if less, discards the lexical pair corresponding to the result.
  • the corresponding similarity average of each vocabulary in the candidate synonym database is substituted into the synonym gradient lifting decision tree model, and the output result of the synonym gradient lifting decision tree model is obtained.
  • the last formed thesaurus can be used in the retrieval application.
  • the synonym corresponding to the keyword can be found in the thesaurus by acquiring the keyword input by the user, and then the information related to the keyword and the synonym of the keyword can be searched.
  • the thesaurus is applied to various search applications. When the user inputs a keyword to search, he or she can select whether to search for the synonym of the keyword. If yes, the keyword and the key can be searched. Information related to the synonym of the word. If you select No, only the information related to the keyword is searched. Therefore, it can be seen that the present invention can not only establish a highly accurate synonym library, but also provide a search application, and more importantly, can provide the user with the function of setting whether to perform synonym retrieval.
  • the synonym data mining method may specifically adopt the following steps:
  • Step 201 Establish a corresponding preliminary synonym database based on the dictionary, the video file library, and the search log record.
  • a preliminary synonym database when a preliminary synonym database is created based on a dictionary, all the words are encoded, and the words appearing in the interpretation of each word can be used as a preliminary synonym vector, and then arranged in a tree structure. That is, the vocabulary is used as a parent node, and its preliminary synonym vector is used as a child node. Finally, the cosine similarity algorithm of the vector is used to calculate the similarity between each vocabulary and the corresponding pre-synonym vector.
  • the title of the video is extracted in a preset video file library, and the words appearing in the same title are added to each other's preliminary synonym vector.
  • the following method is used: for the vocabulary w1 and the synonym w2 corresponding to w1, how many titles are counted in w1, For count(w1), the same count shows how many titles in w2 are counted as count(w2), and then the number of w1 and w2 appearing in the same title at the same time is counted as count(w1, w2), and the similarity of w1 and w2 is calculated. degree:
  • the preliminary synonym database When the preliminary synonym database is built based on the search log, it is based on the user search log, and the two words w1, w2, and the number of the query w1 in the query query appear in the number of query queries, which are counted as count(w1), and the same statistics show how many queries in w2 appear. Record as count(w2). The number of w1 and w2 appearing simultaneously in the same query is counted as count(w1, w2), that is, w1 and w2 are mutually prepared synonym vectors. In addition, w1 and w2 appear in different queries, but the same result is searched and recorded as same(w1, w2). Calculate the similarity of w1 and w2:
  • Step 202 Acquire all vocabulary pairs in the dictionary preparation synonym database, the video file preparation thesaurus, and the search log preparation thesaurus that have a common synonym relationship.
  • Step 203 Extract a similarity value corresponding to each vocabulary pair in the dictionary preparation thesaurus, the video file preparation thesaurus, and the search log preparation thesaurus.
  • Step 204 Add the similarity values of the three vectors corresponding to each pair of words in the candidate synonym database to obtain an average value T.
  • Step 205 establishing a candidate synonym library.
  • a pair of vocabulary is stored in the candidate synonym database, and corresponding lexical pairs are stored on each pair of vocabulary corresponding to the similarity in the dictionary preparation thesaurus, the video file preparation thesaurus, and the search log preparation thesaurus.
  • Degree value which is the similarity of three vectors.
  • the candidate synonym database is represented as (w1, w2, T1, T2, T3), wherein w1 and w2 are words with a preliminary synonym relationship, T1 is a similarity of the dictionary preparation synonym vector, and T2 is a video file.
  • T1 is a similarity of the dictionary preparation synonym vector
  • T2 is a video file.
  • Step 206 Extracting the first to nth pieces of data information (w1, w2, T) from the candidate synonym database as input, and extracting the n+1th to 2nth piece of data information (w1, w2, from the candidate synonym database, T) As an output, train the Gradient Lift Decision Tree (GBDT) model.
  • GBDT Gradient Lift Decision Tree
  • Step 207 obtaining a synonym gradient elevation decision tree (GBDT) model:
  • ⁇ 1 - ⁇ m is m decision trees
  • ⁇ 1 - ⁇ m is the weight of each tree
  • T is the average of the similarity values of the three vectors corresponding to each pair of words.
  • Step 208 Substituting the average value of the similarity values of the three vectors corresponding to each pair of vocabulary in the candidate synonym database into the synonym GBDT model to obtain the output value.
  • Step 209 Determine whether the output value is greater than a preset threshold. If yes, proceed to step 210. If not, proceed to step 211.
  • Step 210 Extract the vocabulary pair corresponding to the output value from the candidate synonym database, and store it in the synonym database.
  • step 211 the vocabulary pair corresponding to the result is discarded.
  • the synonym data mining system includes a candidate synonym database establishing unit 301 and a synonym model establishing unit 302.
  • the thesaurus establishment unit 303 is configured to acquire a vocabulary pair in the dictionary, the video file library, and the search log record, and a similarity value of the vocabulary pair, and establish a lexical pair candidate synonym database associated with the similarity value.
  • Synonym model building unit 302 is used to train and obtain a synonym model according to the data information in the candidate synonym database.
  • the thesaurus establishing unit 303 is configured to substitute the similarity value corresponding to each vocabulary in the candidate synonym database into the synonym model to obtain an output value; and store the vocabulary pair whose output value is greater than a preset threshold in the thesaurus.
  • the candidate synonym database establishing unit 301 establishes a preliminary synonym database based on the dictionary, and stores the associated vocabulary pair and the similarity value of the vocabulary pair in the dictionary preparation thesaurus.
  • the vocabulary appearing in the lexical interpretation is used as a preliminary synonym vector by encoding all the vocabulary in the dictionary. Then, it is arranged in a tree structure, the vocabulary is taken as a parent node, and its preliminary synonym vector is used as a child node.
  • the cosine similarity algorithm of the vector is used to calculate the similarity of each vocabulary to the corresponding pre-synonym vector.
  • a preliminary synonym database is created based on the video file, and the associated vocabulary pair and the similarity value of the vocabulary pair are stored in the video file preparation thesaurus.
  • the vocabulary appearing in the same title is added to the other party's preliminary synonym vector by extracting the title of the video in a preset video file library; for the vocabulary w1 and the synonym w2 corresponding to w1, Calculate the similarity between each vocabulary and its corresponding presynonym vector Where count(w1) is the number of titles that appear in w1, count(w2) is the number of titles that appear in w2, and count(w1, w2) is the number of simultaneous occurrences of w1 and w2 in the same title.
  • a preliminary synonym database is established based on the search log, and the associated vocabulary pair and the similarity value of the vocabulary pair are stored in the search log preparation thesaurus. Specifically, the vocabulary appearing in the same query request and the vocabulary that is the same in different query requests but with the same search result are each other's preliminary synonym vector; for the vocabulary w1 and the synonym w2 corresponding to w1, the vocabulary is calculated The similarity of each prepared synonym vector Where count(w1) is the number of queries that appear in w1, count(w2) is the number of queries that appear in w2, and count(w1, w2) is the number of simultaneous occurrences of w1 and w2 in the same query, same(w1, w2) For w1, w2 in different queries but searched for the same number of results.
  • the candidate synonym database establishing unit 301 acquires all vocabulary pairs that have a preliminary synonym relationship in the dictionary preparation synonym library, the video file preparation synonym database, and the search log preparation synonym database. And, extract each vocabulary pair in the dictionary preparation thesaurus, video file preparation thesaurus and search day The corresponding similarity value in the pre-synonym database. Then create a candidate thesaurus.
  • the candidate synonym database establishing unit 301 adds the similarity values of each vocabulary to the dictionary, the video file library, and the search log record, and stores them in the candidate synonym database. Therefore, the candidate synonym database is represented as (w1, w2, T1, T2, T3, T), where T1 is the similarity value of the vocabulary pair w1, w2 in the dictionary, and T2 is the similarity of the vocabulary pair w1, w2 in the video file library. The value, T3 is the similarity value of the vocabulary pair w1, w2 in the search log record, and T is the average value of the vocabulary pair w1, w2 similarity.
  • the synonym model establishing unit 302 extracts the first to nth pieces of data information (w1, w2, T) from the candidate synonym database as an input, and extracts the n+1th to the 2nd from the candidate synonym database.
  • ⁇ 1 - ⁇ m is m decision trees
  • ⁇ 1 - ⁇ m is the weight of each tree
  • T is the average of the similarity values of the three vectors corresponding to each pair of words.
  • the thesaurus establishing unit 303 substitutes the corresponding similarity value of each vocabulary pair in the candidate synonym database into the synonym gradient lifting decision tree model to obtain an output result of the synonym gradient lifting decision tree model.
  • the synonym data mining method and system creatively provide a method and system for establishing a synonym database; moreover, the synonym in the synonym database is obtained through multi-layer screening and calculation.
  • Highly accurate synonymous lexical pairs; and, the thesaurus can be applied to search applications, not only satisfying the user's need to retrieve more content, but also satisfying user-defined search content (whether or not the synonym search results are included);
  • the invention has broad and significant generalization significance; finally, the entire synonym data mining method and system are compact and easy to limit.
  • the embodiment of the present application further provides a computer storage medium, where the computer storage medium can store a program, and when the program is executed, some or all of the steps of the foregoing synonym data mining methods can be implemented.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种同义词数据挖掘方法和系统,所述方法包括:获取在词典、视频文件库和搜索日志记录中的词汇对和相似度值,建立候选同义词库(101);根据候选同义词库中的数据信息,训练并获得同义词模型(102);将相似度值代入同义词模型,判断结果是否大于阈值,将该结果对应的词汇对存储在同义词库中或舍弃(103)。所述方法可以建立一个具有高准确性的同义词库,并且能够应用于检索应用中,从而使用户在检索应用中检索到更多的内容,提高检索质量。

Description

一种同义词数据挖掘方法和系统
本申请要求于2015年12月9日提交中国专利局、申请号为201510908015.2、发明名称为“一种同义词数据挖掘方法和系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及媒体传播技术领域,特别是指一种同义词数据挖掘方法和系统。
背景技术
随着网络技术的飞速发展,人们对于网络的需求体现在生活的每个角落,开始对社会产生深远的影响。而数据挖掘一般是从大量的数据中自动搜索隐藏于其中的有着特殊关系性的信息的过程,数据挖掘通常与计算机科学有关,并通过统计、在线分析处理、情报检索、机器学习、专家系统和模式识别等诸多方法来实现。
目前,将数据挖掘和网络技术进行结合,出现的网络检索应用中,可以通过输入关键字,并根据该关键字检索到相关的所有内容。但是,在现有技术中网络检索应用只能是具有同样关键字的内容检索出来,从而检索范围很小,无法满足用户的检索需求。另外,如果输入的关键字不准确则要检索的目标内容就可能不会被检索到,现有的网络检索应用使用起来需要大量的时间在确定关键字上,从而用户体验非常差。因此,现在的网络检索应用中急需一种同义词词典库,使其能够检索到更多的内容。
发明内容
有鉴于此,本发明实施例的目的在于提出一种同义词数据挖掘方法和系统,解决了在现有技术中网络检索应用只能是具有同样关键字的内容检索出来的问题。
基于上述目的本发明实施例提供的同义词数据挖掘方法,包括步骤:
获取在词典、视频文件库和搜索日志记录中词汇对,以及该词汇对的相似度值,建立词汇对与相似度值相关联的候选同义词库;
根据候选同义词库中的数据信息,训练并获得同义词模型;
将候选同义词库中每个词汇对应的相似度值代入同义词模型得到输出数值;将所述输出数值大于预设的阈值的词汇对存储在同义词库中。
在一些实施例中,所述在词典中词汇对以及该词汇对的相似度值,通过将词典中所有词汇进行编码,把词汇解释中出现的词汇作为预备同义词向量,然后按照树形结构进行排列,将该词汇作为父节点,而它的预备同义词向量作为子节点,再利用向量的余弦相似度算法计算每个词汇与相对应的每个预备同义词向量的相似度;
所述在视频文件库中词汇对以及该词汇对的相似度值,通过在一个预先设置的视频文件库中抽取视频的标题,在同一个标题中出现的词汇互相加入到对方的预备同义词向量中;对于词汇w1和与w1相对应的同义词w2,计算词汇与其相对应的每个预备同义词向量的相似度
Figure PCTCN2016088681-appb-000001
其中,count(w1)为w1出现的标题数量,count(w2)为w2出现的标题数量,count(w1,w2)为w1、w2在相同的标题中同时出现的数量;
所述在搜索日志记录中,在相同的查询请求中出现的词汇和在不同的查询请求但搜索结果相同的词汇,互为对方的预备同义词向量;对于词汇w1和与w1相对应的同义词w2,计算词汇与其相对应的每个预备同义词向量的相似度
Figure PCTCN2016088681-appb-000002
其中,count(w1)为w1出现的查询数量,count(w2)为w2出现的查询数量,count(w1,w2)为w1、w2在相同的查询中同时出现的数量,same(w1,w2)为w1、w2在不同查询中但搜索了同一个结果的数量。
在一些实施例中,所述在建立词汇对与相似度值相关联的候选同义词库之前,还包括:将每个词汇对在词典、视频文件库和搜索日志记录中的相似度值相加求平均值,并存储在候选同义词库中;
还有,所述的候选同义词库表示为(w1,w2,T1,T2,T3,T),其中T1为词汇对w1、w2在词典中相似度值,T2为词汇对w1、w2在视频文件库中相似度值, T3为词汇对w1、w2在搜索日志记录中相似度值,T为词汇对w1、w2相似度平均值。
在一些实施例中,所述训练并获得同义词模型包括:从候选同义词库中提取第1条至第n条数据信息(w1,w2,T)作为输入,从候选同义词库中提取第n+1条至第2n条数据信息(w1,w2,T)作为输出,训练梯度提升决策树模型;
获得同义词梯度提升决策树模型:F(T)=α1β1(T)+α2β2(T)+...+αmβm(T)
其中,β1m是m棵决策树,α1m是每棵树的权重,T是每一对词汇相对应的三个向量的相似度值相加后的平均值。
在一些实施例中,所述将候选同义词库中每个词汇对对应的相似度值代入同义词模型是将候选同义词库中每个词汇对对应的相似度平均值代入到同义词梯度提升决策树模型中,获得所述同义词梯度提升决策树模型的输出数值。
在另一方面,本发明实施例还提供了一种同义词数据挖掘系统,包括:
候选同义词库建立单元,用于获取在词典、视频文件库和搜索日志记录中词汇对,以及该词汇对的相似度值,建立词汇对与相似度值相关联的候选同义词库;
同义词模型建立单元,用于根据候选同义词库中的数据信息,训练并获得同义词模型;
同义词库建立单元,用于将候选同义词库中每个词汇对应的相似度值代入同义词模型得到输出数值;将所述输出数值大于预设的阈值的词汇对存储在同义词库中。
在一些实施例中,所述候选同义词库建立单元在词典中词汇对以及该词汇对的相似度值,通过将词典中所有词汇进行编码,把词汇解释中出现的词汇作为预备同义词向量,然后按照树形结构进行排列,将该词汇作为父节点,而它的预备同义词向量作为子节点,再利用向量的余弦相似度算法计算每个词汇与相对应的每个预备同义词向量的相似度;
在视频文件库中词汇对以及该词汇对的相似度值,通过在一个预先设置的视频文件库中抽取视频的标题,在同一个标题中出现的词汇互相加入到对方的预备 同义词向量中;对于词汇w1和与w1相对应的同义词w2,计算词汇与其相对应的每个预备同义词向量的相似度
Figure PCTCN2016088681-appb-000003
其中,count(w1)为w1出现的标题数量,count(w2)为w2出现的标题数量,count(w1,w2)为w1、w2在相同的标题中同时出现的数量;
在搜索日志记录中,在相同的查询请求中出现的词汇和在不同的查询请求但搜索结果相同的词汇,互为对方的预备同义词向量;对于词汇w1和与w1相对应的同义词w2,计算词汇与其相对应的每个预备同义词向量的相似度
Figure PCTCN2016088681-appb-000004
其中,count(w1)为w1出现的查询数量,count(w2)为w2出现的查询数量,count(w1,w2)为w1、w2在相同的查询中同时出现的数量,same(w1,w2)为w1、w2在不同查询中但搜索了同一个结果的数量。
在一些实施例中,所述候选同义词库建立单元还用于将每个词汇对在词典、视频文件库和搜索日志记录中的相似度值相加求平均值,并存储在候选同义词库中;
还有,所述的候选同义词库表示为(w1,w2,T1,T2,T3,T),其中T1为词汇对w1、w2在词典中相似度值,T2为词汇对w1、w2在视频文件库中相似度值,T3为词汇对w1、w2在搜索日志记录中相似度值,T为词汇对w1、w2相似度平均值。
在一些实施例中,所述同义词模型建立单元训练并获得同义词模型包括:从候选同义词库中提取第1条至第n条数据信息(w1,w2,T)作为输入,从候选同义词库中提取第n+1条至第2n条数据信息(w1,w2,T)作为输出,训练梯度提升决策树模型;
获得同义词梯度提升决策树模型:F(T)=α1β1(T)+α2β2(T)+...+αmβm(T)
其中,β1m是m棵决策树,α1m是每棵树的权重,T是每一对词汇相对应的三个向量的相似度值相加后的平均值。
在一些实施例中,所述同义词库建立单元将候选同义词库中每个词汇对对应的相似度值代入同义词模型是将候选同义词库中每个词汇对对应的相似度平均值代入到同义词梯度提升决策树模型中,获得所述同义词梯度提升决策树模型的输出数值。
再一方面,本申请实施例还提供了一种计算机存储介质,该计算机存储介质可存储有程序,该程序执行时可实现前述同义词数据挖掘方法各实现方式中的部分或全部步骤。
从上面所述可以看出,本发明实施例提供的同义词数据挖掘方法和系统,通过获取在词典、视频文件库和搜索日志记录中的词汇对,以及该词汇对的相似度值,建立词汇对与相似度值相关联的候选同义词库;根据候选同义词库中的数据信息,训练并获得同义词模型;将候选同义词库中每个词汇对应的相似度值代入同义词模型得到输出数值;将所述输出数值大于预设的阈值的词汇对存储在同义词库中。从而,可以建立一个具有很高准确性的同义词库,并且能够应用于检索应用中,用户在使用检索应用中可以检索到更多的内容,提高检索质量。
附图说明
图1为本发明第一实施例中同义词数据挖掘方法的流程示意图;
图2为本发明可参考实施例中同义词数据挖掘方法的流程示意图;
图3为本发明同义词数据挖掘系统的结构示意图。
具体实施方式
为使本发明的目的、技术方案和优点更加清楚明白,以下结合具体实施例,并参照附图,对本发明进一步详细说明。
根据网络检索应用的使用现状,根据用户的检索需求,无法实现用户检索到更多的内容,导致对于每一个用户在网络检索应用上可以查找到的信息很少,只能是具有同样关键字的内容。为了解决这一问题,本发明从用户角度,体会到用户希望在网络检索应用上能够检索到更多的内容。因此,本发明的思路是在网络检索应用上,设置同义词的检索功能。
参阅图1所示,为本发明第一实施例中同义词数据挖掘方法流程示意图,所述同义词数据挖掘方法包括:
步骤101,获取在词典、视频文件库和搜索日志记录中的词汇对,以及该词汇对的相似度值,建立词汇对与相似度值相关联的候选同义词库。
较佳地,基于词典建立预备同义词库,在词典预备同义词库中存储有联系的词汇对以及该词汇对的相似度值。具体来说,是通过将词典中所有词汇进行编码,把词汇解释中出现的词汇作为预备同义词向量。然后,按照树形结构进行排列,将该词汇作为父节点,而它的预备同义词向量作为子节点。再利用向量的余弦相似度算法计算每个词汇与相对应的每个预备同义词向量的相似度。
较佳地,基于视频文件建立预备同义词库,在视频文件预备同义词库中存储有联系的词汇对以及该词汇对的相似度值。具体来说,是通过在一个预先设置的视频文件库中抽取视频的标题,在同一个标题中出现的词汇互相加入到对方的预备同义词向量中;对于词汇w1和与w1相对应的同义词w2,计算词汇与其相对应的每个预备同义词向量的相似度
Figure PCTCN2016088681-appb-000005
其中,count(w1)为w1出现的标题数量,count(w2)为w2出现的标题数量,count(w1,w2)为w1、w2在相同的标题中同时出现的数量。
在另一个较佳地实施例中,基于搜索日志建立预备同义词库,在搜索日志预备同义词库中存储有联系的词汇对以及该词汇对的相似度值。具体来说,在相同的查询请求中出现的词汇和在不同的查询请求但搜索结果相同的词汇,互为对方的预备同义词向量;对于词汇w1和与w1相对应的同义词w2,计算词汇与其相对应的每个预备同义词向量的相似度
Figure PCTCN2016088681-appb-000006
其中,count(w1)为w1出现的查询数量,count(w2)为w2出现的查询数量,count(w1,w2)为w1、w2在相同的查询中同时出现的数量,same(w1,w2)为w1、w2在不同查询中但搜索了同一个结果的数量。
优选地,获取词典预备同义词库、视频文件预备同义词库和搜索日志预备同义词库中共同具有预备同义词关系的所有词汇对。并且,提取每个词汇对分别在词典预备同义词库、视频文件预备同义词库和搜索日志预备同义词库中对应的相似度值。然后,建立候选同义词库。
作为另一个实施例,将每个词汇对在词典、视频文件库和搜索日志记录中的 相似度值相加求平均值,并存储在候选同义词库中。因此,候选同义词库表示为(w1,w2,T1,T2,T3,T),其中T1为词汇对w1、w2在词典中相似度值,T2为词汇对w1、w2在视频文件库中相似度值,T3为词汇对w1、w2在搜索日志记录中相似度值,T为词汇对w1、w2相似度平均值。
步骤102,根据候选同义词库中的数据信息,训练并获得同义词模型。
较佳地,从候选同义词库中提取第1条至第n条数据信息(w1,w2,T)作为输入,从候选同义词库中提取第n+1条至第2n条数据信息(w1,w2,T)作为输出,训练梯度提升决策树模型。然后便获得同义词梯度提升决策树模型:F(T)=α1β1(T)+α2β2(T)+...+αmβm(T)
其中,β1m是m棵决策树,α1m是每棵树的权重,T是每一对词汇相对应的三个向量的相似度值相加后的平均值。
步骤103,将候选同义词库中每个词汇对应的相似度值代入同义词模型,判断获得的输出数值是否大于预设的阈值;若大于,则将该输出数值果对应的词汇对从候选同义词库中提取,存储在同义词库中;若小于,则舍弃该结果对应的词汇对。
优选地,将候选同义词库中每个词汇对对应的相似度平均值代入到同义词梯度提升决策树模型中,获得所述同义词梯度提升决策树模型的输出结果。
需要说明的是,最后形成的同义词库可以在检索应用中进行使用。在使用时,可以通过获取用户输入的关键词,在同义词库中查找到该关键词相对应的同义词,然后可以搜索到与该关键词以及该关键词的同义词相关的信息。值得说明的是,在同义词库应用于各种搜索应用,用户输入关键词进行搜索时可以选择是否将该关键词的同义词也进行搜索,若选择是,则可以搜索到与该关键词以及该关键词的同义词相关的信息。若选择否,则只搜索该关键词相关的信息。因此,可以看出本发明不仅可以建立一个准确性很高的同义词库,同时还能够提供在检索应用中,更为重要的是可以提供给用户自行设置是否进行同义词检索的功能。
作为一个可参考的实施例,参阅图2所示,所述同义词数据挖掘方法具体可采用如下步骤:
步骤201,基于词典、视频文件库和搜索日志记录,分别建立相应的预备同义词库。
作为实施例,基于词典建立预备同义词库时,将所有词汇进行编码,可以把每个词汇的解释中出现的词汇作为预备同义词向量,然后按照树形结构进行排列。即将该词汇作为父节点,而它的预备同义词向量作为子节点。最后,利用向量的余弦相似度算法计算每个词汇与相对应的每个预备同义词向量的相似度。
而基于视频文件建立预备同义词库时,是在一个预先设置的视频文件库中抽取视频的标题,在同一个标题中出现的词汇互相加入到对方的预备同义词向量中。较佳地,在计算每个词汇与其相对应的每个预备同义词向量的相似度时,是通过如下方法:对于词汇w1和与w1相对应的同义词w2,统计w1在多少个标题中出现,记为count(w1),同样统计w2在多少个标题中出现记为count(w2),然后w1、w2在相同的标题中同时出现的数量记为count(w1,w2),计算w1、w2的相似度:
Figure PCTCN2016088681-appb-000007
而基于搜索日志建立预备同义词库时,是基于用户搜索日志,对两个词汇w1、w2,统计w1在多少个查询query中出现,记为count(w1),同样统计w2在多少个query中出现记为count(w2)。w1、w2在相同的query中同时出现的数量记为count(w1,w2),即w1、w2互为对方的预备同义词向量。另外,w1、w2出现在不同query中,但是搜索了同一个结果,记为same(w1,w2)。计算w1、w2的相似度:
Figure PCTCN2016088681-appb-000008
步骤202,获取词典预备同义词库、视频文件预备同义词库和搜索日志预备同义词库中共同具有预备同义词关系的所有词汇对。
步骤203,提取每个词汇对分别在词典预备同义词库、视频文件预备同义词库和搜索日志预备同义词库中对应的相似度值。
步骤204,将候选同义词库中的每一对词汇相对应的三个向量的相似度值相加求平均值T。
步骤205,建立候选同义词库。
在实施例中,在候选同义词库中存储有一对一对的词汇,并且在每对词汇上存储有该词汇对在词典预备同义词库、视频文件预备同义词库和搜索日志预备同义词库中对应的相似度值,即三个向量的相似度。具体的实施方式中,候选同义词库表示为(w1,w2,T1,T2,T3),其中w1和w2为具有预备同义词关系的词汇,T1为词典预备同义词库向量的相似度,T2为视频文件预备同义词库向量的相似度,T3为搜索日志预备同义词库向量的相似度。
步骤206,从候选同义词库中提取第1条至第n条数据信息(w1,w2,T)作为输入,从候选同义词库中提取第n+1条至第2n条数据信息(w1,w2,T)作为输出,训练梯度提升决策树(GBDT)模型。
步骤207,获得同义词梯度提升决策树(GBDT)模型:
F(T)=α1β1(T)+α2β2(T)+...+αmβm(T)
其中,β1m是m棵决策树,α1m是每棵树的权重,T是每一对词汇相对应的三个向量的相似度值相加后的平均值。
步骤208,将候选同义词库中每对词汇对应的三个向量的相似度值相加后的平均值代入到同义词GBDT模型中,获得输出的数值。
步骤209,判断该输出数值是否大于预设的阈值,若大于则进行步骤210,若小于则进行步骤211。
步骤210,将该输出数值对应的词汇对从候选同义词库中提取,存储在同义词库中。
步骤211,舍弃该结果对应的词汇对。
在本发明实施例的另一方面,还提供了一种同义词数据挖掘系统,如图3所示,所述的同义词数据挖掘系统包括依次连接的候选同义词库建立单元301、同义词模型建立单元302、同义词库建立单元303。其中,候选同义词库建立单元301用于获取在词典、视频文件库和搜索日志记录中的词汇对,以及该词汇对的相似度值,建立词汇对与相似度值相关联的候选同义词库。同义词模型建立单元 302用于根据候选同义词库中的数据信息,训练并获得同义词模型。同义词库建立单元303用于将候选同义词库中每个词汇对应的相似度值代入同义词模型得到输出数值;将所述输出数值大于预设的阈值的词汇对存储在同义词库中。
可选地,所述候选同义词库建立单元301基于词典建立预备同义词库,在词典预备同义词库中存储有联系的词汇对以及该词汇对的相似度值。具体来说,是通过将词典中所有词汇进行编码,把词汇解释中出现的词汇作为预备同义词向量。然后,按照树形结构进行排列,将该词汇作为父节点,而它的预备同义词向量作为子节点。再利用向量的余弦相似度算法计算每个词汇与相对应的每个预备同义词向量的相似度。
基于视频文件建立预备同义词库,在视频文件预备同义词库中存储有联系的词汇对以及该词汇对的相似度值。具体来说,是通过在一个预先设置的视频文件库中抽取视频的标题,在同一个标题中出现的词汇互相加入到对方的预备同义词向量中;对于词汇w1和与w1相对应的同义词w2,计算词汇与其相对应的每个预备同义词向量的相似度
Figure PCTCN2016088681-appb-000009
其中,count(w1)为w1出现的标题数量,count(w2)为w2出现的标题数量,count(w1,w2)为w1、w2在相同的标题中同时出现的数量。
基于搜索日志建立预备同义词库,在搜索日志预备同义词库中存储有联系的词汇对以及该词汇对的相似度值。具体来说,在相同的查询请求中出现的词汇和在不同的查询请求但搜索结果相同的词汇,互为对方的预备同义词向量;对于词汇w1和与w1相对应的同义词w2,计算词汇与其相对应的每个预备同义词向量的相似度
Figure PCTCN2016088681-appb-000010
其中,count(w1)为w1出现的查询数量,count(w2)为w2出现的查询数量,count(w1,w2)为w1、w2在相同的查询中同时出现的数量,same(w1,w2)为w1、w2在不同查询中但搜索了同一个结果的数量。
可选地,候选同义词库建立单元301获取词典预备同义词库、视频文件预备同义词库和搜索日志预备同义词库中共同具有预备同义词关系的所有词汇对。并且,提取每个词汇对分别在词典预备同义词库、视频文件预备同义词库和搜索日 志预备同义词库中对应的相似度值。然后建立候选同义词库。
另外,候选同义词库建立单元301将每个词汇对在词典、视频文件库和搜索日志记录中的相似度值相加求平均值,并存储在候选同义词库中。因此,候选同义词库表示为(w1,w2,T1,T2,T3,T),其中T1为词汇对w1、w2在词典中相似度值,T2为词汇对w1、w2在视频文件库中相似度值,T3为词汇对w1、w2在搜索日志记录中相似度值,T为词汇对w1、w2相似度平均值。
作为另一个实施例,同义词模型建立单元302从候选同义词库中提取第1条至第n条数据信息(w1,w2,T)作为输入,从候选同义词库中提取第n+1条至第2n条数据信息(w1,w2,T)作为输出,训练梯度提升决策树模型。然后便获得同义词梯度提升决策树模型:F(T)=α1β1(T)+α2β2(T)+...+αmβm(T)
其中,β1m是m棵决策树,α1m是每棵树的权重,T是每一对词汇相对应的三个向量的相似度值相加后的平均值。
可选地,同义词库建立单元303将候选同义词库中每个词汇对对应的相似度平均值代入到同义词梯度提升决策树模型中,获得所述同义词梯度提升决策树模型的输出结果。
需要说明的是,在本发明所述的同义词数据挖掘系统的具体实施内容,在上面所述的同义词数据挖掘方法中已经详细说明了,故在此重复内容不再说明。
综上所述,本发明实施例提供的同义词数据挖掘方法、系统,创造性地提供了一种同义词库的建立方法和系统;而且,该同义词库中的同义词都是经过多层筛选、计算获得的高精确的同义词汇对;并且,该同义词库可以应用于搜索应用中,不仅满足了用户需要检索到更多内容的要求,还能够满足用户自定义检索内容(是否包括同义词的检索结果);因此本发明具有广泛、重大的推广意义;最后,整个所述的同义词数据挖掘方法和系统紧凑,易于限制。
本申请实施例还提供了一种计算机存储介质,该计算机存储介质可存储有程序,该程序执行时可实现前述同义词数据挖掘方法各实现方式中的部分或全部步骤。
所属领域的普通技术人员应当理解:以上所述仅为本发明的具体实施例而已,并不用于限制本发明,凡在本发明的精神和原则之内,所做的任何修改、等 同替换、改进等,均应包含在本发明的保护范围之内。

Claims (10)

  1. 一种同义词数据挖掘方法,其特征在于,包括步骤:
    获取词典、视频文件库和搜索日志记录中的词汇对,以及所述词汇对的相似度值,建立所述词汇对与相似度值相关联的候选同义词库;
    根据所述候选同义词库中的数据信息,训练并获得同义词模型;
    将所述候选同义词库中每个词汇对应的相似度值代入所述同义词模型得到输出数值;
    将所述输出数值大于预设的阈值的词汇对存储在同义词库中。
  2. 根据权利要求1所述的方法,其特征在于,所述获取词典中的词汇对以及所述词汇对的相似度值包括:
    将词典中所有词汇进行编码,把每个词汇的词汇解释中出现的词汇作为预备同义词向量,按照树形结构进行排列,将所述每个词汇作为父节点,而它的预备同义词向量作为子节点,再利用向量的余弦相似度算法计算所述每个词汇与相对应的每个预备同义词向量的相似度;
    所述获取视频文件库中词汇对以及所述词汇对的相似度值包括:在一个预先设置的视频文件库中抽取视频的标题,在同一个标题中出现的词汇互相加入到对方的预备同义词向量中;对于词汇w1和与w1相对应的同义词w2,计算词汇与其相对应的每个预备同义词向量的相似度
    Figure PCTCN2016088681-appb-100001
    其中,count(w1)为w1出现的标题数量,count(w2)为w2出现的标题数量,count(w1,w2)为w1、w2在相同的标题中同时出现的数量;
    所述获取搜索日志记录中词汇对以及所述词汇对的相似度值包括,在相同的查询请求中出现的词汇和在不同的查询请求但搜索结果相同的词汇,互为对方的预备同义词向量;对于词汇w1和与w1相对应的同义词w2,计算词汇与其相对应的每个预备同义词向量的相似度
    Figure PCTCN2016088681-appb-100002
    其中,count(w1)为w1出现的 查询数量,count(w2)为w2出现的查询数量,count(w1,w2)为w1、w2在相同的查询中同时出现的数量,same(w1,w2)为w1、w2在不同查询中但搜索了同一个结果的数量。
  3. 根据权利要求2所述的方法,其特征在于,在所述建立所述词汇对与相似度值相关联的候选同义词库之前,还包括:
    将所述每个词汇对在词典、视频文件库和搜索日志记录中的相似度值相加求平均值,并存储在候选同义词库中;
    所述的候选同义词库表示为(w1,w2,T1,T2,T3,T),其中T1为词汇对w1、w2在词典中相似度值,T2为词汇对w1、w2在视频文件库中相似度值,T3为词汇对w1、w2在搜索日志记录中相似度值,T为词汇对w1、w2相似度平均值。
  4. 根据权利要求3所述的方法,其特征在于,所述训练并获得同义词模型包括:从候选同义词库中提取第1条至第n条数据信息(w1,w2,T)作为输入,从候选同义词库中提取第n+1条至第2n条数据信息(w1,w2,T)作为输出,训练梯度提升决策树模型;
    获得同义词梯度提升决策树模型:F(T)=α1β1(T)+α2β2(T)+...+αmβm(T)
    其中,β1m是m棵决策树,α1m是每棵树的权重,T是每一对词汇相对应的三个向量的相似度值相加后的平均值。
  5. 根据权利要求4所述的方法,其特征在于,所述将所述候选同义词库中所述每个词汇对对应的相似度值代入所述同义词模型包括:将所述候选同义词库中所述每个词汇对对应的相似度平均值代入到所述同义词梯度提升决策树模型中,获得所述同义词梯度提升决策树模型的输出数值。
  6. 一种同义词数据挖掘系统,其特征在于,包括:
    候选同义词库建立单元,用于获取词典、视频文件库和搜索日志记录中的词汇对,以及所述词汇对的相似度值,建立所述词汇对与相似度值相关联的候选同义词库;
    同义词模型建立单元,用于根据所述候选同义词库中的数据信息,训练并获得同义词模型;
    同义词库建立单元,用于将所述候选同义词库中每个词汇对应的相似度值代入所述同义词模型得到输出数值;将所述输出数值大于预设的阈值的词汇对存储在同义词库中。
  7. 根据权利要求6所述的系统,其特征在于,
    所述候选同义词库建立单元,还用于将词典中所有词汇进行编码,把每个词汇的词汇解释中出现的词汇作为预备同义词向量,按照树形结构进行排列,将所述每个词汇作为父节点,而它的预备同义词向量作为子节点,再利用向量的余弦相似度算法计算所述每个词汇与相对应的每个预备同义词向量的相似度;
    同义词模型建立单元,还用于在一个预先设置的视频文件库中抽取视频的标题,在同一个标题中出现的词汇互相加入到对方的预备同义词向量中;对于词汇w1和与w1相对应的同义词w2,计算词汇与其相对应的每个预备同义词向量的相似度
    Figure PCTCN2016088681-appb-100003
    其中,count(w1)为w1出现的标题数量,count(w2)为w2出现的标题数量,count(w1,w2)为w1、w2在相同的标题中同时出现的数量;
    同义词库建立单元,还用于在相同的查询请求中出现的词汇和在不同的查询请求但搜索结果相同的词汇,互为对方的预备同义词向量;对于词汇w1和与w1相对应的同义词w2,计算词汇与其相对应的每个预备同义词向量的相似度
    Figure PCTCN2016088681-appb-100004
    其中,count(w1)为w1出现的查询数量,count(w2)为w2出现的查询数量,count(w1,w2)为w1、w2在相同的查询中同时出现的数量,same(w1,w2)为w1、w2在不同查询中但搜索了同一个结果的数量。
  8. 根据权利要求7所述的系统,其特征在于,
    所述候选同义词库建立单元还用于,将每个词汇对在词典、视频文件库和搜索日志记录中的相似度值相加求平均值,并存储在候选同义词库中;
    所述的候选同义词库表示为(w1,w2,T1,T2,T3,T),其中T1为词汇对w1、w2在词典中相似度值,T2为词汇对w1、w2在视频文件库中相似度值, T3为词汇对w1、w2在搜索日志记录中相似度值,T为词汇对w1、w2相似度平均值。
  9. 根据权利要求8所述的系统,其特征在于,
    所述同义词模型建立单元,还用于从候选同义词库中提取第1条至第n条数据信息(w1,w2,T)作为输入,从候选同义词库中提取第n+1条至第2n条数据信息(w1,w2,T)作为输出,训练梯度提升决策树模型;
    获得同义词梯度提升决策树模型:F(T)=α1β1(T)+α2β2(T)+...+αmβm(T)
    其中,β1m是m棵决策树,α1m是每棵树的权重,T是每一对词汇相对应的三个向量的相似度值相加后的平均值。
  10. 根据权利要求9所述的系统,其特征在于,
    所述同义词库建立单元,还用于将所述候选同义词库中所述每个词汇对对应的相似度值代入所述同义词模型是将候选同义词库中每个词汇对对应的相似度平均值代入到同义词梯度提升决策树模型中,获得所述同义词梯度提升决策树模型的输出数值。
PCT/CN2016/088681 2015-12-09 2016-07-05 一种同义词数据挖掘方法和系统 WO2017096819A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US15/242,271 US20170169012A1 (en) 2015-12-09 2016-08-19 Method and System for Synonym Data Mining

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510908015.2A CN105868236A (zh) 2015-12-09 2015-12-09 一种同义词数据挖掘方法和系统
CN201510908015.2 2015-12-09

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/242,271 Continuation US20170169012A1 (en) 2015-12-09 2016-08-19 Method and System for Synonym Data Mining

Publications (1)

Publication Number Publication Date
WO2017096819A1 true WO2017096819A1 (zh) 2017-06-15

Family

ID=56624366

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/088681 WO2017096819A1 (zh) 2015-12-09 2016-07-05 一种同义词数据挖掘方法和系统

Country Status (3)

Country Link
US (1) US20170169012A1 (zh)
CN (1) CN105868236A (zh)
WO (1) WO2017096819A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112287080A (zh) * 2020-10-23 2021-01-29 平安科技(深圳)有限公司 问题语句的改写方法、装置、计算机设备和存储介质

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9342502B2 (en) 2013-11-20 2016-05-17 International Business Machines Corporation Contextual validation of synonyms in otology driven natural language processing
CN107038173B (zh) * 2016-02-04 2021-06-25 腾讯科技(深圳)有限公司 应用查询方法和装置、相似应用检测方法和装置
CN106844571B (zh) * 2017-01-03 2020-04-07 北京齐尔布莱特科技有限公司 识别同义词的方法、装置和计算设备
CN107016055B (zh) * 2017-03-03 2020-12-18 阿里巴巴(中国)有限公司 用于挖掘实体别名的方法、设备及电子设备
CN107122423A (zh) * 2017-04-06 2017-09-01 深圳Tcl数字技术有限公司 影视推介方法及装置
CN107203504B (zh) * 2017-05-18 2021-02-26 北京京东尚科信息技术有限公司 字符串替换方法和装置
CN108932222B (zh) * 2017-05-22 2021-11-19 中国移动通信有限公司研究院 一种获取词语相关度的方法及装置
CN107451126B (zh) * 2017-08-21 2020-07-28 广州多益网络股份有限公司 一种近义词筛选方法及系统
CN107679030B (zh) * 2017-09-04 2021-08-13 北京京东尚科信息技术有限公司 基于用户操作行为数据提取同义词的方法和装置
CN108255810B (zh) * 2018-01-10 2019-04-09 北京神州泰岳软件股份有限公司 近义词挖掘方法、装置及电子设备
US11182416B2 (en) 2018-10-24 2021-11-23 International Business Machines Corporation Augmentation of a text representation model
CN110032675A (zh) * 2019-03-13 2019-07-19 平安城市建设科技(深圳)有限公司 基于共现词的检索方法、装置、设备及可读存储介质
CN110069599A (zh) * 2019-03-13 2019-07-30 平安城市建设科技(深圳)有限公司 基于近似词的检索方法、装置、设备及可读存储介质
CN110222513B (zh) * 2019-05-21 2023-06-23 平安科技(深圳)有限公司 一种线上活动的异常监测方法、装置及存储介质
CN112084290B (zh) * 2019-06-13 2024-04-05 北京沃东天骏信息技术有限公司 一种数据检索方法、装置、设备及存储介质
CN113011166A (zh) * 2021-04-19 2021-06-22 华北电力大学 一种基于决策树分类的继电保护缺陷文本同义词识别方法
CN114861638B (zh) * 2022-06-10 2024-05-24 安徽工程大学 一种中文同义词扩展方法及装置
CN117093715B (zh) * 2023-10-18 2023-12-29 湖南财信数字科技有限公司 词库扩充方法、系统、计算机设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070239742A1 (en) * 2006-04-06 2007-10-11 Oracle International Corporation Determining data elements in heterogeneous schema definitions for possible mapping
CN102591862A (zh) * 2011-01-05 2012-07-18 华东师范大学 一种基于词共现的汉语实体关系提取的控制方法及装置
CN102693279A (zh) * 2012-04-28 2012-09-26 合一网络技术(北京)有限公司 一种快速计算评论相似度的方法、装置及系统
CN105095204A (zh) * 2014-04-17 2015-11-25 阿里巴巴集团控股有限公司 同义词的获取方法及装置

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001043236A (ja) * 1999-07-30 2001-02-16 Matsushita Electric Ind Co Ltd 類似語抽出方法、文書検索方法及びこれらに用いる装置
EP1779263A1 (de) * 2004-08-13 2007-05-02 Swiss Reinsurance Company Sprach- und textanalysevorrichtung und entsprechendes verfahren
US9600566B2 (en) * 2010-05-14 2017-03-21 Microsoft Technology Licensing, Llc Identifying entity synonyms
US8688688B1 (en) * 2011-07-14 2014-04-01 Google Inc. Automatic derivation of synonym entity names
CN104978356B (zh) * 2014-04-10 2019-09-06 阿里巴巴集团控股有限公司 一种同义词的识别方法及装置
JP2017514256A (ja) * 2014-04-24 2017-06-01 セマンティック・テクノロジーズ・プロプライエタリー・リミテッド オントロジアライナ方法、セマンティックマッチング方法及び装置
US10095784B2 (en) * 2015-05-29 2018-10-09 BloomReach, Inc. Synonym generation

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070239742A1 (en) * 2006-04-06 2007-10-11 Oracle International Corporation Determining data elements in heterogeneous schema definitions for possible mapping
CN102591862A (zh) * 2011-01-05 2012-07-18 华东师范大学 一种基于词共现的汉语实体关系提取的控制方法及装置
CN102693279A (zh) * 2012-04-28 2012-09-26 合一网络技术(北京)有限公司 一种快速计算评论相似度的方法、装置及系统
CN105095204A (zh) * 2014-04-17 2015-11-25 阿里巴巴集团控股有限公司 同义词的获取方法及装置

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112287080A (zh) * 2020-10-23 2021-01-29 平安科技(深圳)有限公司 问题语句的改写方法、装置、计算机设备和存储介质
CN112287080B (zh) * 2020-10-23 2023-10-03 平安科技(深圳)有限公司 问题语句的改写方法、装置、计算机设备和存储介质

Also Published As

Publication number Publication date
US20170169012A1 (en) 2017-06-15
CN105868236A (zh) 2016-08-17

Similar Documents

Publication Publication Date Title
WO2017096819A1 (zh) 一种同义词数据挖掘方法和系统
WO2021139074A1 (zh) 基于知识图谱的案件检索方法、装置、设备及存储介质
US9679558B2 (en) Language modeling for conversational understanding domains using semantic web resources
CN104765769B (zh) 一种基于词矢量的短文本查询扩展及检索方法
JP5346279B2 (ja) 検索による注釈付与
WO2018157805A1 (zh) 一种自动问答处理方法及自动问答系统
US9424294B2 (en) Method for facet searching and search suggestions
US10289717B2 (en) Semantic search apparatus and method using mobile terminal
US10437868B2 (en) Providing images for search queries
KR20190020119A (ko) 검색어를 위한 오류 정정 방법 및 기기
US20150356091A1 (en) Method and system for identifying microblog user identity
CN105159938B (zh) 检索方法和装置
US9251289B2 (en) Matching target strings to known strings
Hakkani-Tür et al. Probabilistic enrichment of knowledge graph entities for relation detection in conversational understanding
WO2014206241A1 (zh) 文档相似度计算方法、近似重复文档检测方法及装置
US20100131485A1 (en) Method and system for automatic construction of information organization structure for related information browsing
CN110347790B (zh) 基于注意力机制的文本查重方法、装置、设备及存储介质
CN109408578A (zh) 一种针对异构环境监测数据融合方法
CN103714118A (zh) 图书交叉阅读方法
CN111090771A (zh) 歌曲搜索方法、装置及计算机存储介质
WO2013107031A1 (zh) 基于评论信息确定视频质量参数的方法、装置和系统
CN116361510A (zh) 一种利用影视类作品和剧本建立剧本场景片段视频自动提取和检索的方法和装置
CN110727769A (zh) 语料库生成方法及装置、人机交互处理方法及装置
CN107133274B (zh) 一种基于图知识库的分布式信息检索集合选择方法
CN109903198B (zh) 专利对比分析方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16872022

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16872022

Country of ref document: EP

Kind code of ref document: A1