WO2018187949A1 - Perspective analysis method for machine learning model - Google Patents

Perspective analysis method for machine learning model Download PDF

Info

Publication number
WO2018187949A1
WO2018187949A1 PCT/CN2017/080173 CN2017080173W WO2018187949A1 WO 2018187949 A1 WO2018187949 A1 WO 2018187949A1 CN 2017080173 W CN2017080173 W CN 2017080173W WO 2018187949 A1 WO2018187949 A1 WO 2018187949A1
Authority
WO
WIPO (PCT)
Prior art keywords
query
result
learning model
user
machine learning
Prior art date
Application number
PCT/CN2017/080173
Other languages
French (fr)
Chinese (zh)
Inventor
邹霞
Original Assignee
邹霞
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 邹霞 filed Critical 邹霞
Priority to PCT/CN2017/080173 priority Critical patent/WO2018187949A1/en
Publication of WO2018187949A1 publication Critical patent/WO2018187949A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N99/00Subject matter not provided for in other groups of this subclass

Definitions

  • the present invention relates to a perspective analysis method of a machine learning model, and belongs to the field of Internet search.
  • search engines have become an important tool for people to use Internet information resources.
  • search engines such as Google, Yahoo!. Bing, and Baidu
  • the relevance of query results has attracted more and more attention.
  • the pros and cons of sorting the results of the query have also become the main indicators for evaluating the search engine.
  • the user gives the keyword as a query request
  • the search engine queries the index database according to the user query, and returns the retrieval result of the sorting and correlation analysis to the user, helping the person to reject and ignore a large amount of irrelevant information, thereby Play the role of information navigation.
  • the massive amount of information data means massive search results.
  • most users of the cable engine only browse the first few pages of the returned results, and rarely care about the lower ranked pages. Search results with strong correlation should be ranked higher, while weak correlation results should be ranked lower. Therefore, sorting the query results according to their relevance becomes one of the core problems of search engines. The relevance ranking of search results has also become an important indicator for evaluating search engine performance.
  • a multidimensional feature vector is used to represent the relevant attributes and information of each data pair (user query-query result). Extract some data pairs in the dataset and manually identify the relevance of the query results and user queries in each data pair.
  • the machine learning model is trained using the already identified data as a training data set, and the resulting machine learning model is used to predict the relevance of the unknown query and the query results.
  • machine learning models can predict errors in the application process, such as noise or extreme training data, such as unstable data distribution and defects in the machine learning model itself.
  • Step 1 collecting error data fed back by the user and extracting basic information, and extracting relevant information in the feedback data to generate a feature space vector;
  • Step 2 Calculating the score of the query result, using the original model and the sub-model to classify the user query result, and obtain the classification result, that is, the evaluation score;
  • Step 3 For each user query, calculate the DCG value of the query result, and the actual sorting can be obtained according to the training result of the machine learning model, and the ideal sort can be obtained according to the query result and the user query.
  • the value of DCG of the user query is obtained immediately by actual sorting and ideal sorting;
  • Step four clustering, according to the "DCG value change trend, obtain the optimal sub-model of each query, and according to The similarity of the sub-models clusters the user queries;
  • Step 5 extracting attributes, analyzing all member information in each class, and extracting some attributes as feature space vectors of the class;
  • Step 6 Learning an unknown user query, when an unknown user query is given, analyzing its attributes, and classifying the user query, thereby obtaining the user query is learning, corresponding to the most Excellent child model.
  • the user feedback data collected in step 1 above includes attribute information of a series of query results.
  • the original learning model is used to learn each query result, and the result of the classification on each decision tree is obtained to calculate the score of the query result.
  • the query result obtained by the prediction result of each decision tree is obtained, and according to the correlation between the query result and the user query, the ideal order of the query result can be obtained.
  • the ideal order of the query result can be obtained.
  • the user query can obtain the value of the DCG on all the sub-models, and the same can obtain the variation curve of the DCG.
  • Extract the decision tree constructor model that makes DCG have the largest value.
  • the perspective analysis method of the machine learning model provided by the present invention analyzes the sub-models inside the learning model, filters the sub-models with poor classification results, and selects the sub-models with better classification results, and The selected sub-models are reorganized to generate a new learning model, and the resulting new learning model has higher prediction accuracy.
  • the present invention provides a perspective analysis method for a machine learning model.
  • the present invention will be further described in detail below in order to clarify the objects, the technical solutions and the effects of the present invention. It should be understood that The specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
  • the query result with high relevance is more applicable to the user, and should be ranked in the front position in the query list.
  • the query result has a better correlation with the user query, indicating that the query result has more application value.
  • indicates the position of each query result in the query list; re/ ; indicates the relevance of the query result at the first position.
  • This embodiment first collects user feedback data.
  • the user feedback data includes the relevance of the current query result to the user query.
  • the feedback data is learned and trained using the established machine learning model.
  • the score sorts the results of the query, and the result of this sort is the actual sort result.
  • the sorting of the query results according to the relevance of the user mark is an ideal sort result.
  • the correlation between the query result and the query, and the ideal sorting result the value of DCG can be calculated. Namely: ⁇ . ⁇ 3 ⁇ 4
  • the embodiment includes the following steps:
  • Extracting attributes analyzing all member information in each class, and extracting some attributes as feature space vectors of this class;
  • the collected user feedback data includes attribute information of a series of query results. It mainly includes: user query query, query result set D, and each query result doc corresponds to a search web page url. The degree of relevance of each query result to the user query, the tag information id, and other attribute information used for classification in the learning model.
  • the feature space vector of the user feedback data can be used ⁇ e, doc, rating, features> ⁇ 7 ⁇ .
  • represents each query result in the query list, re /; represents the first; position, relevance of search results?.
  • the feature vector space model can then be expressed as: ⁇ query, nDCG nDCG 2 , , nDCG n >.
  • Clustering for each user query query, the user query can obtain the value of DCG on all sub-models, and the DCG curve can be obtained.
  • the decision tree construction submodel m ⁇ [t rl , t r2 t ra ] which makes the «DCG have the maximum value can be extracted, and the sub-model training results can obtain the largest "Z) CG value.
  • the evaluation score of each user query result can be obtained. Sort user query results based on evaluation scores.
  • the perspective analysis method of the machine learning model provided by the present invention analyzes the sub-models inside the learning model, filters the sub-models with poor classification results, and selects the sub-models with better classification results, and The selected sub-models are reorganized to generate a new learning model, and the resulting new learning model has higher prediction accuracy.

Abstract

A perspective analysis method for a machine learning model, comprising: collecting error data fed back by a user and extracting basic information, and extracting related information in the feedback data to generate a feature spatial vector; calculating a score of the query result and using an original model and submodels to learn and classify the user query result to obtain the classification result, i.e., an evaluation score; and for each user query, calculating an nDCG value of the query result, obtaining the actual sorting according to the training result of the machine learning model, and obtaining an ideal sorting according to the query result and the user query. The method comprises analyzing the submodels in a learning model, filtering the submodels having bad classification results, selecting the submodels having good classification results, and regrouping the selected submodels to generate a new learning model. The generated new learning model has higher prediction accuracy.

Description

说明书 发明名称:机器学习模型的透视分析方法 技术领域  Specification Name of Invention: Perspective Analysis Method of Machine Learning Model Technical Field
[0001] 本发明涉及一种机器学习模型的透视分析方法, 属于互联网搜索领域。  [0001] The present invention relates to a perspective analysis method of a machine learning model, and belongs to the field of Internet search.
背景技术  Background technique
[0002] 随着互联网的快速发展, 搜索引擎成为人们使用 Internet信息资源的重要工具 。 伴随 Google、 Yahoo! . Bing、 百度等搜索引擎的兴起和发展, 査询结果的相 关度越来越受到人们的关注。 査询结果排序的优劣亦成为评价搜索弓 I擎的主要 指标。  [0002] With the rapid development of the Internet, search engines have become an important tool for people to use Internet information resources. With the rise and development of search engines such as Google, Yahoo!. Bing, and Baidu, the relevance of query results has attracted more and more attention. The pros and cons of sorting the results of the query have also become the main indicators for evaluating the search engine.
[0003] 随着信息技术快速发展和广泛应用, 互联网得到了蓬勃发展, 成为全球最大的 信息资源, 在人们的生活中已经占据了重要的位置。 互联网也成为了人们进行 信息共享和交互的重要平台。 用户要在如此庞大杂乱的互联网资源中査找所需 要的信息, 就像大海捞针一样, 而搜索引擎恰好解决了这一问题。 搜索引擎是 基于互联网平台, 是提供网络信息检索服务的工具。 搜索引擎也成为是互联网 技术中最重要的应用。 用户给出关键词作为査询请求, 搜索引擎根据用户査询 在自己的索引数据库中进行査询, 并将排序和相关性分析的检索结果返回给用 户, 帮助人们拒绝和忽略大量无关信息, 从而起到信息导航的作用。 而海量的 信息数据则意味着海量的搜索结果。 在实际应用中, 大多数索引擎的用户只对 返回结果的前几页进行浏览, 很少关心排名较后的网页。 具有强相关性的搜索 结果应该排在比较靠前的位置, 而弱相关性的搜索结果则应该排在比较靠后的 位置。 因此根据其相关性对査询结果进行排序成为搜索引擎的核心问题之一。 搜索结果的相关性排序也成为评价搜索引擎性能的重要指标。  [0003] With the rapid development and wide application of information technology, the Internet has prospered and become the world's largest information resource, which has occupied an important position in people's lives. The Internet has also become an important platform for people to share and interact with information. Users need to find the information they need in such a large and messy Internet resource, just like a needle in a haystack, and the search engine just solves this problem. The search engine is based on the Internet platform and is a tool for providing network information retrieval services. Search engines have also become the most important applications in Internet technology. The user gives the keyword as a query request, the search engine queries the index database according to the user query, and returns the retrieval result of the sorting and correlation analysis to the user, helping the person to reject and ignore a large amount of irrelevant information, thereby Play the role of information navigation. And the massive amount of information data means massive search results. In practical applications, most users of the cable engine only browse the first few pages of the returned results, and rarely care about the lower ranked pages. Search results with strong correlation should be ranked higher, while weak correlation results should be ranked lower. Therefore, sorting the query results according to their relevance becomes one of the core problems of search engines. The relevance ranking of search results has also become an important indicator for evaluating search engine performance.
[0004] 在搜索引擎排序问题中, 使用一个多维的特征向量表示每个数据对 (用户査询 -査询结果) 的相关属性和信息。 抽取数据集中的部分数据对, 并人为的标识每 个数据对中査询结果和用户査询的相关性。 使用已经标识的数据作为训练数据 集来训练机器学习模型, 并使用得到的机器学习模型来预测未知査询和査询结 果的相关度。 然而无论一个机器学习模型的理论基础多么强大, 我们总可以在 应用过程中发现其不吋出现的错误。 很多原因可以导致机器学习模型在应用过 程中的预测错误, 比如带有噪音或是比较极端的训练数据, 比如不稳定的数据 分布以及机器学习模型本身的缺陷等等。 [0004] In the search engine ranking problem, a multidimensional feature vector is used to represent the relevant attributes and information of each data pair (user query-query result). Extract some data pairs in the dataset and manually identify the relevance of the query results and user queries in each data pair. The machine learning model is trained using the already identified data as a training data set, and the resulting machine learning model is used to predict the relevance of the unknown query and the query results. However, no matter how powerful the theoretical foundation of a machine learning model, we can always During the application process, it found that it did not appear wrong. There are many reasons why machine learning models can predict errors in the application process, such as noise or extreme training data, such as unstable data distribution and defects in the machine learning model itself.
[0005] 然而针对提高机器学习模型预测准确性的研究过程中, 我们面临的一个难题就 是: 机器学习模型在训练完成后变成了一个"黑盒子", 应用过程只是: 对其提供 一些输入, 机器学习模型针对输入给出输出作为对输入的预测结果。 我们在应 用过程中完全无法获知机器学习模型的完成过程。 机器学习模型内部的数据处 理过程和结果计算过程对于我们来说是不可见的。 因此面对错误的预测结果, 我们常常难以判断应当如何调整机器学习模型的内部结构, 以提高它的预测准 确性。  [0005] However, in the research process to improve the prediction accuracy of the machine learning model, one of the problems we face is: The machine learning model becomes a "black box" after the training is completed, and the application process is only: Provide some input to it. The machine learning model gives an output to the input as a predictor of the input. We were completely unable to know the completion of the machine learning model during the application process. The data processing process and the result calculation process inside the machine learning model are invisible to us. Therefore, in the face of erroneous predictions, it is often difficult to judge how the internal structure of the machine learning model should be adjusted to improve its predictive accuracy.
技术问题  technical problem
[0006] 为了提高机器学习模型的性能, 通常的做法是不断收集错误的用户反馈数据作 为额外的训练数据来重新建立新的学习模型。 然而原始的学习模型在大部分的 测试数据集中已经达到良好的效果。 因为少量的反馈数据就需要重新建立新学 习模型。 这样会大大降低搜索的效率。 而学习模型一旦建立, 模型的修改就变 得比较困难。  [0006] In order to improve the performance of machine learning models, it is common practice to continuously collect erroneous user feedback data as additional training data to re-establish a new learning model. However, the original learning model has achieved good results in most of the test data sets. Because of the small amount of feedback data, it is necessary to re-establish a new learning model. This will greatly reduce the efficiency of the search. Once the learning model is established, the modification of the model becomes more difficult.
问题的解决方案  Problem solution
技术解决方案  Technical solution
[0007] 鉴于上述现有技术的不足之处, 本发明的目的在于提供一种机器学习模型的透 视分析方法, 包括:  In view of the above deficiencies of the prior art, it is an object of the present invention to provide a method for visual analysis of a machine learning model, comprising:
[0008] 步骤一、 收集用户反馈的错误数据并抽取基本信息, 抽取反馈数据中的相关信 息生成特征空间向量;  [0008] Step 1: collecting error data fed back by the user and extracting basic information, and extracting relevant information in the feedback data to generate a feature space vector;
[0009] 步骤二、 计算査询结果的分数, 使用原始模型以及子模型对用户査询结果进行 学习分类, 得到分类结果即评价分数;  [0009] Step 2: Calculating the score of the query result, using the original model and the sub-model to classify the user query result, and obtain the classification result, that is, the evaluation score;
[0010] 步骤三、 对于每个用户査询, 计算査询结果的《DCG值, 根据机器学习模型训 练结果可得到实际排序, 根据査询结果与用户査询可得到理想排序。 由实际排 序和理想排序即刻得到该用户査询的《DCG的值; [0010] Step 3: For each user query, calculate the DCG value of the query result, and the actual sorting can be obtained according to the training result of the machine learning model, and the ideal sort can be obtained according to the query result and the user query. The value of DCG of the user query is obtained immediately by actual sorting and ideal sorting;
[0011] 步骤四、 聚类, 根据《DCG值变化趋势, 获得每个査询的最优子模型, 并根据 子模型的相似度对用户査询进行聚类; [0011] Step four, clustering, according to the "DCG value change trend, obtain the optimal sub-model of each query, and according to The similarity of the sub-models clusters the user queries;
[0012] 步骤五、 抽取属性, 分析每个类中的所有成员信息, 并抽取某些属性作为这个 类的特征空间向量; [0012] Step 5, extracting attributes, analyzing all member information in each class, and extracting some attributes as feature space vectors of the class;
[0013] 步骤六、 学习未知的用户査询, 当给定一个未知的用户査询, 分析其属性, 并 将该用户査询进行分类, 从而得到该用户査询在进行学习吋, 对应的最优子模 型。  [0013] Step 6: Learning an unknown user query, when an unknown user query is given, analyzing its attributes, and classifying the user query, thereby obtaining the user query is learning, corresponding to the most Excellent child model.
[0014] 优选的, 上述步骤一收集的用户反馈数据中, 包含一系列査询结果的属性信息  [0014] Preferably, the user feedback data collected in step 1 above includes attribute information of a series of query results.
[0015] 优选的, 上述步骤二使用原始的学习模型对每个査询结果进行学习, 并可得到 每棵决策树上分类的结果, 以计算査询结果的分数。 [0015] Preferably, in the second step, the original learning model is used to learn each query result, and the result of the classification on each decision tree is obtained to calculate the score of the query result.
[0016] 优选的, 上述步骤三中得到査询结果由每个决策树预测分类之后得到的排序结 果, 根据该査询结果与用户査询相关度, 可以得到査询结果的理想排序。 根据 实际排序和理想排序, 计算每个子模型《DCG的值。 [0016] Preferably, in the above step 3, the query result obtained by the prediction result of each decision tree is obtained, and according to the correlation between the query result and the user query, the ideal order of the query result can be obtained. Calculate the value of each submodel "DCG" based on the actual order and the ideal order.
[0017] 优选的, 上述步骤四对于每一个用户査询, 可以获得该用户査询在所有子模型 上《DCG的值, 同吋可以得到《DCG的变化曲线, 根据《DCG的变化曲线, 可以 抽取使得《DCG具有最大值的决策树构造子模型。 [0017] Preferably, in the above step 4, for each user query, the user query can obtain the value of the DCG on all the sub-models, and the same can obtain the variation curve of the DCG. According to the variation curve of the DCG, Extract the decision tree constructor model that makes DCG have the largest value.
[0018] 优选的, 上述步骤五得到聚类结果后, 分析每个类中的所有成员, 并抽取某些 属性作为类别的特征空间向量。 [0018] Preferably, after the clustering result is obtained in the above step 5, all members in each class are analyzed, and some attributes are extracted as the feature space vector of the category.
发明的有益效果  Advantageous effects of the invention
有益效果  Beneficial effect
[0019] 相比现有技术, 本发明提供的机器学习模型的透视分析方法, 对学习模型内部 的子模型进行分析, 过滤分类结果较差的子模型, 选择分类结果比较好的子模 型, 并对选择的子模型进行重组, 生成一个新的学习模型, 产生的新的学习模 型具有更高的预测准确率。  [0019] Compared with the prior art, the perspective analysis method of the machine learning model provided by the present invention analyzes the sub-models inside the learning model, filters the sub-models with poor classification results, and selects the sub-models with better classification results, and The selected sub-models are reorganized to generate a new learning model, and the resulting new learning model has higher prediction accuracy.
本发明的实施方式 Embodiments of the invention
[0020] 本发明提供一种机器学习模型的透视分析方法, 为使本发明的目的、 技术方案 及效果更加清楚、 明确, 以下举实施例对本发明进一步详细说明。 应当理解, 此处所描述的具体实施例仅用以解释本发明, 并不用于限定本发明。 The present invention provides a perspective analysis method for a machine learning model. The present invention will be further described in detail below in order to clarify the objects, the technical solutions and the effects of the present invention. It should be understood that The specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
[0021] 通过学习机器学习模型获得査询结果的排序之后, 需要制定统一的标准评价排 序结果的好坏。 在搜索引擎算法研究中, 人们经常使用 DCG来衡量排序算法的 优劣。 在使用 DCG吋, 具有两个假设条件:  [0021] After learning the machine learning model to obtain the ranking of the query results, it is necessary to develop a uniform standard evaluation ranking result. In the search engine algorithm research, DCG is often used to measure the pros and cons of the sorting algorithm. After using DCG吋, there are two assumptions:
[0022] 1.在搜索引擎査询结果列表中, 具有高相关度的査询结果对于用户来说更有应 用价值, 在査询列表中应当排在靠前的位置。  [0022] 1. In the search engine query result list, the query result with high relevance is more applicable to the user, and should be ranked in the front position in the query list.
[0023] 2.对于用户来说, 査询结果与用户査询具有更好的相关度, 则表示该査询结果 具有更多的应用价值。  [0023] 2. For the user, the query result has a better correlation with the user query, indicating that the query result has more application value.
[0024] 首先得到实际排序的 值, 计算方法如公式 (1) 所示。
Figure imgf000005_0001
[0024] First, the actual sorted values are obtained, and the calculation method is as shown in the formula (1).
Figure imgf000005_0001
(1)  (1)
[0026] 其中: ρ表示每个査询结果在査询列表中的位置; re/ ;表示在第;?个位置, 査 询结果的相关度。 [0026] wherein: ρ indicates the position of each query result in the query list; re/ ; indicates the relevance of the query result at the first position.
[0027] 理想排序结果的 值计算公式如公式 (2) 所示。  [0027] The value calculation formula of the ideal sort result is as shown in the formula (2).
Figure imgf000005_0002
Figure imgf000005_0002
[0029] 则可得到:  [0029] then can be obtained:
Figure imgf000005_0003
Figure imgf000005_0003
[0031] «DCG可以很好的评估排序结果的好坏。 《表示只考虑査询结果列表中的前《个 査询结果的好坏。  [0031] «DCG can be a good evaluation of the quality of the ranking results. "Represents only the quality of the previous "query results in the list of query results."
[0032] 本实施例首先收集用户反馈数据。 用户反馈数据中包含当前査询结果与用户査 询的相关度。 得到用户反馈数据之后, 使用已经建立的机器学习模型对反馈数 据进行学习训练。 对于每个査询结果, 都会得到一个训练的分数。 根据最终的 分数对査询结果进行排序, 这个排序的结果即为实际排序结果。 而根据用户标 记的相关度对査询结果的排序为理想排序结果。 获得实际排序结果, 査询结果 与査询的相关度以及理想排序结果之后, 即可计算得到《DCG的值。 即: 闺 . 膽¾ [0032] This embodiment first collects user feedback data. The user feedback data includes the relevance of the current query result to the user query. After the user feedback data is obtained, the feedback data is learned and trained using the established machine learning model. For each query result, you get a training score. According to the final The score sorts the results of the query, and the result of this sort is the actual sort result. The sorting of the query results according to the relevance of the user mark is an ideal sort result. After obtaining the actual sorting result, the correlation between the query result and the query, and the ideal sorting result, the value of DCG can be calculated. Namely: 闺. 胆3⁄4
: 一 .壓 : a. pressure
[0034] 进一步地, 本实施例包括以下步骤: [0034] Further, the embodiment includes the following steps:
[0035] 1.收集用户反馈的错误数据并抽取基本信息, 抽取反馈数据中的相关信息生成 特征空间向量;  [0035] 1. Collecting error data fed back by the user and extracting basic information, and extracting relevant information in the feedback data to generate a feature space vector;
[0036] 2.计算査询结果的分数: 使用原始模型以及子模型对用户査询结果进行学习分 类, 得到分类结果即评价分数;  [0036] 2. Calculating the score of the query result: using the original model and the sub-model to classify the user query result, and obtain the classification result, that is, the evaluation score;
[0037] 3.对于每个用户査询, 计算査询结果的《DCG值: 根据机器学习模型训练结果 可得到实际排序, 根据査询结果与用户査询可得到理想排序。 由实际排序和理 想排序即刻得到该用户査询的《DCG的值; [0037] 3. For each user query, calculate the DCG value of the query result: According to the training result of the machine learning model, the actual sorting can be obtained, and the ideal sort can be obtained according to the query result and the user query. The value of DCG of the user query is obtained immediately by actual sorting and ideal sorting;
[0038] 4.聚类: 根据《DCG值变化趋势, 获得每个査询的最优子模型, 并根据子模型 的相似度对用户査询进行聚类; [0038] 4. Clustering: According to the trend of DCG value change, obtain the optimal sub-model of each query, and cluster the user query according to the similarity of the sub-model;
[0039] 5.抽取属性: 分析每个类中的所有成员信息, 并抽取某些属性作为这个类的特 征空间向量; [0039] 5. Extracting attributes: analyzing all member information in each class, and extracting some attributes as feature space vectors of this class;
[0040] 6.学习未知的用户査询: 当给定一个未知的用户査询, 分析其属性, 并将该用 户査询进行分类, 从而得到该用户査询在进行学习吋, 对应的最优子模型。  [0040] 6. Learning an unknown user query: When an unknown user query is given, the attributes are analyzed, and the user query is classified, so that the user query is learned, and the corresponding optimal Submodel.
[0041] 其中, 收集的用户反馈数据中, 包含一系列査询结果的属性信息。 主要包括: 用户査询 query , 査询结果集合 D, 每一个査询结果 doc对应于一个搜索网页 url。 每个査询结果与用户査询的相关度 rating, 标记信息 id, 以及其它在学习模型中 用于分类的属性信息 features 用户反馈数据的特征空间向量可用 < e , doc, rating, features>^7 ^。  [0041] wherein the collected user feedback data includes attribute information of a series of query results. It mainly includes: user query query, query result set D, and each query result doc corresponds to a search web page url. The degree of relevance of each query result to the user query, the tag information id, and other attribute information used for classification in the learning model. The feature space vector of the user feedback data can be used <e, doc, rating, features>^7 ^.
[0042] 在本实施例中, 考虑不同用户査询下的査询结果是没有意义的。 所有的排序问 题中, 讨论的都是同一个用户査询下的査询结果集合。 rating分为六类, 相关程 度由弱到强依次为: Unjudged, Bad, Fair, Good, Excellent, Perfect。 相关程度由数 值表示分别为 0, 1, 2, 3, 4, 5。 按照各个査询结果的相关度进行排序, 得到的排序 的结果即为理想的排序。 [0042] In this embodiment, it is meaningless to consider the query result under different user queries. In all sorting problems, all of the query results are discussed under the same user query. Rating is divided into six categories, the degree of correlation from weak to strong: Unjudged, Bad, Fair, Good, Excellent, Perfect. Relevant degree Values are 0, 1, 2, 3, 4, and 5, respectively. Sorting according to the relevance of each query result, the result of the sorting is the ideal sort.
[0043] 计算査询结果的分数, 使用原始的学习模型对每个査询结果进行学习, 并可得 到每棵决策树上分类的结果, 即评价分数。  [0043] Calculating the score of the query result, using the original learning model to learn each query result, and obtaining the result of classification on each decision tree, that is, the evaluation score.
[0044] 计算《DCG值, 对于每个 query , 可以得到査询结果由每个决策树预测分类之后 得到的排序结果。 根据该査询结果与用户査询相关度, 可以得到査询结果的理 想排序。 根据实际排序和理想排序, 可以计算每个子模型《DCG的值。 计算方法 如下:  [0044] Calculate the DCG value. For each query, the ranking result obtained after the classification result is predicted by each decision tree can be obtained. According to the correlation between the query result and the user query, the ideal order of the query results can be obtained. According to the actual sorting and ideal sorting, the value of each sub-model "DCG" can be calculated. The calculation method is as follows:
[0045] 实际排序的 值, 厦:' :  [0045] The actual sorted value, Ha:: :
[0046] 其中: ρ表示每个査询结果在査询列表中的位置, re/ ;表示在第;?个位置, 査 询结果的相关度。 [0046] where: ρ represents each query result in the query list, re /; represents the first; position, relevance of search results?.
[0047] 理想排序结果的 值,
Figure imgf000007_0001
[0047] the value of the ideal ranking result,
Figure imgf000007_0001
[0048] 则可得到  [0048] then available
〜 .、 〜 》 :: ~ ., ~ 》 ::
亂 ― - "-  Chaos - - "-
[0049] 特征向量空间模型则可表示为: < query, nDCG nDCG 2, , nDCG n>。 [0049] The feature vector space model can then be expressed as: < query, nDCG nDCG 2 , , nDCG n >.
[0050] 聚类, 对于每一个用户査询 query , 可以获得该用户査询在所有子模型上《DCG 的值, 同吋可以得到《DCG的变化曲线。 根据《DCG的变化曲线, 可以抽取使得 «DCG具有最大值的决策树构造子模型 m ^[t rl, t r2 t ra ], 该子模型训练的排 序结果可以获得最大的《Z)CG值。 《DCG值越大, 表示实际排序结果越接近于理 想排序结果, 同吋表明子模型《^=[^ 12 ,..., ^可以达到最高的预测准确率, 并具有最优的预测性能。 [0050] Clustering, for each user query query, the user query can obtain the value of DCG on all sub-models, and the DCG curve can be obtained. According to the variation curve of DCG, the decision tree construction submodel m ^[t rl , t r2 t ra ] which makes the «DCG have the maximum value can be extracted, and the sub-model training results can obtain the largest "Z) CG value. "The larger the DCG value, the closer the actual ranking result is to the ideal ranking result. The same indicates that the sub-model "^=[^ 12 ,..., ^ can achieve the highest prediction accuracy and has the best prediction performance.
[0051] 抽取类别的属性, 得到聚类结果 =^ ;,0..,^ 后, 需要分析每个类中的 所有成员, 并抽取某些属性作为类别的特征空间向量。 数学表示可表示为: 类 C 的特征空间向量 F ;= ;, /;2,..., 。 [0051] extracting the attributes of the category, and obtaining the clustering result =^;, 0.., ^, after analyzing each class All members, and extract some attributes as the feature space vector of the category. The mathematical representation can be expressed as: The feature space vector F of class C ; = ;, / ; 2 ,..., .
[0052] 学习未知的用户査询, 对未知的用户査询 g„进行预测吋, 首先抽取该査询 g„ 的属性, 包括该用户査询的长度, 用户搜索该査询的次数, 以及该査询中出现 的单词以及各单词出现的次数。 该用户査询^的特征向量/„={1^, 1\2 ,..., 1_} 。 得到用户査询的特征向量之后, 计算该用户査询到各个类别的距离, 距离最小 的类别即为该用户査询的分类结果。 设分类结果用 c ,表示, 则 c ,计算方法见以 下公式 [0052] learning an unknown user query, predicting an unknown user query, first extracting the attribute of the query g„, including the length of the user query, the number of times the user searches the query, and the The words that appear in the query and the number of times each word appears. The user queries the feature vector of ^/„={1^, 1\ 2 ,..., 1_}. After obtaining the feature vector of the user query, the distance calculated by the user to each category is calculated, and the distance is the smallest category. That is, the classification result of the user query. Let the classification result be represented by c, then c, the calculation method is as follows
, . 薩^( ∑ 》 , Sa ^ ( ∑ 》
[0053] 使用 c的最优子模型对当前得到的用户査询结果进行学习分类, 即可得到每个 用户査询结果的评价分数。 根据评价分数, 对用户査询结果进行排序。 [0053] Using the optimal sub-model of c to classify the currently obtained user query results, the evaluation score of each user query result can be obtained. Sort user query results based on evaluation scores.
[0054]  [0054]
[0055] 相比现有技术, 本发明提供的机器学习模型的透视分析方法, 对学习模型内部 的子模型进行分析, 过滤分类结果较差的子模型, 选择分类结果比较好的子模 型, 并对选择的子模型进行重组, 生成一个新的学习模型, 产生的新的学习模 型具有更高的预测准确率。  Compared with the prior art, the perspective analysis method of the machine learning model provided by the present invention analyzes the sub-models inside the learning model, filters the sub-models with poor classification results, and selects the sub-models with better classification results, and The selected sub-models are reorganized to generate a new learning model, and the resulting new learning model has higher prediction accuracy.
[0056]  [0056]
[0057] 可以理解的是, 对本领域普通技术人员来说, 可以根据本发明的技术方案及其 发明构思加以等同替换或改变, 而所有这些改变或替换都应属于本发明所附的 权利要求的保护范围。  [0057] It is to be understood that those skilled in the art can make equivalent substitutions or changes in accordance with the technical solutions of the present invention and the inventive concept thereof, and all such changes or substitutions should belong to the appended claims. protected range.

Claims

权利要求书 Claim
一种机器学习模型的透视分析方法, 其特征在于所述方法包括以下步 骤: A perspective analysis method of a machine learning model, characterized in that the method comprises the following steps:
步骤一、 收集用户反馈的错误数据并抽取基本信息, 抽取反馈数据中 的相关信息生成特征空间向量; Step 1: Collect error data fed back by the user and extract basic information, and extract relevant information in the feedback data to generate a feature space vector;
步骤二、 计算査询结果的分数, 使用原始模型以及子模型对用户査询 结果进行学习分类, 得到分类结果即评价分数; Step 2: Calculate the score of the query result, use the original model and the sub-model to classify the user query result, and obtain the classification result, that is, the evaluation score;
步骤三、 对于每个用户査询, 计算査询结果的《DCG值, 根据机器学 习模型训练结果可得到实际排序, 根据査询结果与用户査询可得到理 想排序, 由实际排序和理想排序即刻得到该用户査询的《DCG的值; 步骤四、 聚类, 根据《DCG值变化趋势, 获得每个査询的最优子模型 , 并根据子模型的相似度对用户査询进行聚类; Step 3: For each user query, calculate the DCG value of the query result. According to the training result of the machine learning model, the actual sorting can be obtained. According to the query result and the user query, the ideal sort can be obtained, and the actual sorting and the ideal sorting are instant. The value of the DCG obtained by the user is obtained; step four, clustering, according to the trend of DCG value, obtaining an optimal sub-model of each query, and clustering the user query according to the similarity of the sub-model;
步骤五、 抽取属性, 分析每个类中的所有成员信息, 并抽取某些属性 作为这个类的特征空间向量; Step 5: Extract attributes, analyze all member information in each class, and extract some attributes as feature space vectors of the class;
步骤六、 学习未知的用户査询, 当给定一个未知的用户査询, 分析其 属性, 并将该用户査询进行分类, 从而得到该用户査询在进行学习吋 , 对应的最优子模型。 Step 6: Learning an unknown user query, when an unknown user query is given, analyzing its attributes, and classifying the user query, thereby obtaining the optimal sub-model corresponding to the user query. .
如权利要求 1所述的机器学习模型的透视分析方法, 其特征在于: 所 述步骤一收集的用户反馈数据中, 包含一系列査询结果的属性信息。 如权利要求 1所述的机器学习模型的透视分析方法, 其特征在于: 所 述步骤二使用原始的学习模型对每个査询结果进行学习, 并可得到每 棵决策树上分类的结果, 以计算査询结果的分数。 The perspective analysis method of the machine learning model according to claim 1, wherein: the user feedback data collected in the step 1 includes attribute information of a series of query results. The perspective analysis method of the machine learning model according to claim 1, wherein: in step 2, the original learning model is used to learn each query result, and the result of classification on each decision tree is obtained. Calculate the score of the query result.
如权利要求 1所述的机器学习模型的透视分析方法, 其特征在于: 所 述步骤三中得到査询结果由每个决策树预测分类之后得到的排序结果The perspective analysis method of the machine learning model according to claim 1, wherein: the result of the query in the third step is obtained by predicting the classification result obtained by each decision tree.
, 根据该査询结果与用户査询相关度, 可以得到査询结果的理想排序According to the query result and the user query relevance, the ideal order of the query results can be obtained.
, 根据实际排序和理想排序, 计算每个子模型《DCG的值。 Calculate the value of each submodel "DCG" based on the actual order and the ideal order.
如权利要求 1所述的机器学习模型的透视分析方法, 其特征在于: 所 述步骤四对于每一个用户査询, 可以获得该用户査询在所有子模型上 «DCG的值, 同吋可以得到《DCG的变化曲线, 根据《DCG的变化曲 线, 可以抽取使得《DCG具有最大值的决策树构造子模型。 The perspective analysis method of the machine learning model according to claim 1, wherein: in step 4, for each user query, the user query can be obtained on all submodels. «The value of DCG, the same can get the "DCG curve", according to the "DCG curve, you can extract the decision tree structure sub-model that makes DCG have the maximum value.
[权利要求 6] 如权利要求 1所述的机器学习模型的透视分析方法, 其特征在于: 所 述步骤五得到聚类结果后, 分析每个类中的所有成员, 并抽取某些属 性作为类别的特征空间向量。 [Claim 6] The perspective analysis method of the machine learning model according to claim 1, wherein: after the step 5 obtains the clustering result, all members in each class are analyzed, and some attributes are extracted as categories. Feature space vector.
PCT/CN2017/080173 2017-04-12 2017-04-12 Perspective analysis method for machine learning model WO2018187949A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/080173 WO2018187949A1 (en) 2017-04-12 2017-04-12 Perspective analysis method for machine learning model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/080173 WO2018187949A1 (en) 2017-04-12 2017-04-12 Perspective analysis method for machine learning model

Publications (1)

Publication Number Publication Date
WO2018187949A1 true WO2018187949A1 (en) 2018-10-18

Family

ID=63792211

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/080173 WO2018187949A1 (en) 2017-04-12 2017-04-12 Perspective analysis method for machine learning model

Country Status (1)

Country Link
WO (1) WO2018187949A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101556603A (en) * 2009-05-06 2009-10-14 北京航空航天大学 Coordinate search method used for reordering search results
US20100076949A1 (en) * 2008-09-09 2010-03-25 Microsoft Corporation Information Retrieval System
CN103530321A (en) * 2013-09-18 2014-01-22 上海交通大学 Sequencing system based on machine learning
CN103544307A (en) * 2013-11-04 2014-01-29 北京中搜网络技术股份有限公司 Multi-search-engine automatic comparison and evaluation method independent of document library
CN103646070A (en) * 2013-12-06 2014-03-19 北京趣拿软件科技有限公司 Data processing method and device for search engine
CN106339383A (en) * 2015-07-07 2017-01-18 阿里巴巴集团控股有限公司 Method and system for sorting search

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100076949A1 (en) * 2008-09-09 2010-03-25 Microsoft Corporation Information Retrieval System
CN101556603A (en) * 2009-05-06 2009-10-14 北京航空航天大学 Coordinate search method used for reordering search results
CN103530321A (en) * 2013-09-18 2014-01-22 上海交通大学 Sequencing system based on machine learning
CN103544307A (en) * 2013-11-04 2014-01-29 北京中搜网络技术股份有限公司 Multi-search-engine automatic comparison and evaluation method independent of document library
CN103646070A (en) * 2013-12-06 2014-03-19 北京趣拿软件科技有限公司 Data processing method and device for search engine
CN106339383A (en) * 2015-07-07 2017-01-18 阿里巴巴集团控股有限公司 Method and system for sorting search

Similar Documents

Publication Publication Date Title
Lu et al. A new algorithm for inferring user search goals with feedback sessions
CN110825877A (en) Semantic similarity analysis method based on text clustering
WO2017210949A1 (en) Cross-media retrieval method
CN103559191B (en) Based on latent space study and Bidirectional sort study across media sort method
US20180341686A1 (en) System and method for data search based on top-to-bottom similarity analysis
JP2009151760A (en) Method and system for calculating competitiveness metric between objects
CN105678590B (en) Cloud model-based topN recommendation method for social network
CN108647729B (en) User portrait acquisition method
CN106980651B (en) Crawling seed list updating method and device based on knowledge graph
CN110737805B (en) Method and device for processing graph model data and terminal equipment
CN103544307B (en) A kind of multiple search engine automation contrast evaluating method independent of document library
CN103761286B (en) A kind of Service Source search method based on user interest
CN110442736A (en) A kind of semantically enhancement subspace cross-media retrieval method based on quadratic discriminatory analysis
CN101226547A (en) Web entity recognition method for entity recognition system
TWI452477B (en) Multi-label text categorization based on fuzzy similarity and k nearest neighbors
CN113591947A (en) Power data clustering method and device based on power consumption behaviors and storage medium
CN103425748A (en) Method and device for mining document resource recommended words
Manjusha et al. Web mining framework for security in e-commerce
CN112508363A (en) Deep learning-based power information system state analysis method and device
CN114238439B (en) Task-driven relational data view recommendation method based on joint embedding
Vijaya et al. LionRank: lion algorithm-based metasearch engines for re-ranking of webpages
WO2018187949A1 (en) Perspective analysis method for machine learning model
WO2013177751A1 (en) Cross-media retrieval method based on generalized linear regression model
CN112579783A (en) Short text clustering method based on Laplace map
Zhang et al. Research and implementation of keyword extraction algorithm based on professional background knowledge

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17905604

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 200220)

122 Ep: pct application non-entry in european phase

Ref document number: 17905604

Country of ref document: EP

Kind code of ref document: A1