WO2018187949A1

WO2018187949A1 - Perspective analysis method for machine learning model

Info

Publication number: WO2018187949A1
Application number: PCT/CN2017/080173
Authority: WO
Inventors: 邹霞
Original assignee: 邹霞
Priority date: 2017-04-12
Filing date: 2017-04-12
Publication date: 2018-10-18

Abstract

A perspective analysis method for a machine learning model, comprising: collecting error data fed back by a user and extracting basic information, and extracting related information in the feedback data to generate a feature spatial vector; calculating a score of the query result and using an original model and submodels to learn and classify the user query result to obtain the classification result, i.e., an evaluation score; and for each user query, calculating an nDCG value of the query result, obtaining the actual sorting according to the training result of the machine learning model, and obtaining an ideal sorting according to the query result and the user query. The method comprises analyzing the submodels in a learning model, filtering the submodels having bad classification results, selecting the submodels having good classification results, and regrouping the selected submodels to generate a new learning model. The generated new learning model has higher prediction accuracy.

Description

Specification Name of Invention: Perspective Analysis Method of Machine Learning Model Technical Field

[0001] The present invention relates to a perspective analysis method of a machine learning model, and belongs to the field of Internet search.

Background technique

[0002] With the rapid development of the Internet, search engines have become an important tool for people to use Internet information resources. With the rise and development of search engines such as Google, Yahoo!. Bing, and Baidu, the relevance of query results has attracted more and more attention. The pros and cons of sorting the results of the query have also become the main indicators for evaluating the search engine.

[0003] With the rapid development and wide application of information technology, the Internet has prospered and become the world's largest information resource, which has occupied an important position in people's lives. The Internet has also become an important platform for people to share and interact with information. Users need to find the information they need in such a large and messy Internet resource, just like a needle in a haystack, and the search engine just solves this problem. The search engine is based on the Internet platform and is a tool for providing network information retrieval services. Search engines have also become the most important applications in Internet technology. The user gives the keyword as a query request, the search engine queries the index database according to the user query, and returns the retrieval result of the sorting and correlation analysis to the user, helping the person to reject and ignore a large amount of irrelevant information, thereby Play the role of information navigation. And the massive amount of information data means massive search results. In practical applications, most users of the cable engine only browse the first few pages of the returned results, and rarely care about the lower ranked pages. Search results with strong correlation should be ranked higher, while weak correlation results should be ranked lower. Therefore, sorting the query results according to their relevance becomes one of the core problems of search engines. The relevance ranking of search results has also become an important indicator for evaluating search engine performance.

[0004] In the search engine ranking problem, a multidimensional feature vector is used to represent the relevant attributes and information of each data pair (user query-query result). Extract some data pairs in the dataset and manually identify the relevance of the query results and user queries in each data pair. The machine learning model is trained using the already identified data as a training data set, and the resulting machine learning model is used to predict the relevance of the unknown query and the query results. However, no matter how powerful the theoretical foundation of a machine learning model, we can always During the application process, it found that it did not appear wrong. There are many reasons why machine learning models can predict errors in the application process, such as noise or extreme training data, such as unstable data distribution and defects in the machine learning model itself.

[0005] However, in the research process to improve the prediction accuracy of the machine learning model, one of the problems we face is: The machine learning model becomes a "black box" after the training is completed, and the application process is only: Provide some input to it. The machine learning model gives an output to the input as a predictor of the input. We were completely unable to know the completion of the machine learning model during the application process. The data processing process and the result calculation process inside the machine learning model are invisible to us. Therefore, in the face of erroneous predictions, it is often difficult to judge how the internal structure of the machine learning model should be adjusted to improve its predictive accuracy.

technical problem

[0006] In order to improve the performance of machine learning models, it is common practice to continuously collect erroneous user feedback data as additional training data to re-establish a new learning model. However, the original learning model has achieved good results in most of the test data sets. Because of the small amount of feedback data, it is necessary to re-establish a new learning model. This will greatly reduce the efficiency of the search. Once the learning model is established, the modification of the model becomes more difficult.

Problem solution

Technical solution

In view of the above deficiencies of the prior art, it is an object of the present invention to provide a method for visual analysis of a machine learning model, comprising:

[0008] Step 1: collecting error data fed back by the user and extracting basic information, and extracting relevant information in the feedback data to generate a feature space vector;

[0009] Step 2: Calculating the score of the query result, using the original model and the sub-model to classify the user query result, and obtain the classification result, that is, the evaluation score;

[0010] Step 3: For each user query, calculate the DCG value of the query result, and the actual sorting can be obtained according to the training result of the machine learning model, and the ideal sort can be obtained according to the query result and the user query. The value of DCG of the user query is obtained immediately by actual sorting and ideal sorting;

[0011] Step four, clustering, according to the "DCG value change trend, obtain the optimal sub-model of each query, and according to The similarity of the sub-models clusters the user queries;

[0012] Step 5, extracting attributes, analyzing all member information in each class, and extracting some attributes as feature space vectors of the class;

[0013] Step 6: Learning an unknown user query, when an unknown user query is given, analyzing its attributes, and classifying the user query, thereby obtaining the user query is learning, corresponding to the most Excellent child model.

[0014] Preferably, the user feedback data collected in step 1 above includes attribute information of a series of query results.

[0015] Preferably, in the second step, the original learning model is used to learn each query result, and the result of the classification on each decision tree is obtained to calculate the score of the query result.

[0016] Preferably, in the above step 3, the query result obtained by the prediction result of each decision tree is obtained, and according to the correlation between the query result and the user query, the ideal order of the query result can be obtained. Calculate the value of each submodel "DCG" based on the actual order and the ideal order.

[0017] Preferably, in the above step 4, for each user query, the user query can obtain the value of the DCG on all the sub-models, and the same can obtain the variation curve of the DCG. According to the variation curve of the DCG, Extract the decision tree constructor model that makes DCG have the largest value.

[0018] Preferably, after the clustering result is obtained in the above step 5, all members in each class are analyzed, and some attributes are extracted as the feature space vector of the category.

Advantageous effects of the invention

Beneficial effect

[0019] Compared with the prior art, the perspective analysis method of the machine learning model provided by the present invention analyzes the sub-models inside the learning model, filters the sub-models with poor classification results, and selects the sub-models with better classification results, and The selected sub-models are reorganized to generate a new learning model, and the resulting new learning model has higher prediction accuracy.

Embodiments of the invention

The present invention provides a perspective analysis method for a machine learning model. The present invention will be further described in detail below in order to clarify the objects, the technical solutions and the effects of the present invention. It should be understood that The specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

[0021] After learning the machine learning model to obtain the ranking of the query results, it is necessary to develop a uniform standard evaluation ranking result. In the search engine algorithm research, DCG is often used to measure the pros and cons of the sorting algorithm. After using DCG吋, there are two assumptions:

[0022] 1. In the search engine query result list, the query result with high relevance is more applicable to the user, and should be ranked in the front position in the query list.

[0023] 2. For the user, the query result has a better correlation with the user query, indicating that the query result has more application value.

[0024] First, the actual sorted values are obtained, and the calculation method is as shown in the formula (1).

(1)

[0026] wherein: ρ indicates the position of each query result in the query list; re/ _; indicates the relevance of the query result at the first position.

[0027] The value calculation formula of the ideal sort result is as shown in the formula (2).

[0029] then can be obtained:

[0031] «DCG can be a good evaluation of the quality of the ranking results. "Represents only the quality of the previous "query results in the list of query results."

[0032] This embodiment first collects user feedback data. The user feedback data includes the relevance of the current query result to the user query. After the user feedback data is obtained, the feedback data is learned and trained using the established machine learning model. For each query result, you get a training score. According to the final The score sorts the results of the query, and the result of this sort is the actual sort result. The sorting of the query results according to the relevance of the user mark is an ideal sort result. After obtaining the actual sorting result, the correlation between the query result and the query, and the ideal sorting result, the value of DCG can be calculated. Namely: 闺. 胆3⁄4

^: a. pressure

[0034] Further, the embodiment includes the following steps:

[0035] 1. Collecting error data fed back by the user and extracting basic information, and extracting relevant information in the feedback data to generate a feature space vector;

[0036] 2. Calculating the score of the query result: using the original model and the sub-model to classify the user query result, and obtain the classification result, that is, the evaluation score;

[0037] 3. For each user query, calculate the DCG value of the query result: According to the training result of the machine learning model, the actual sorting can be obtained, and the ideal sort can be obtained according to the query result and the user query. The value of DCG of the user query is obtained immediately by actual sorting and ideal sorting;

[0038] 4. Clustering: According to the trend of DCG value change, obtain the optimal sub-model of each query, and cluster the user query according to the similarity of the sub-model;

[0039] 5. Extracting attributes: analyzing all member information in each class, and extracting some attributes as feature space vectors of this class;

[0040] 6. Learning an unknown user query: When an unknown user query is given, the attributes are analyzed, and the user query is classified, so that the user query is learned, and the corresponding optimal Submodel.

[0041] wherein the collected user feedback data includes attribute information of a series of query results. It mainly includes: user query query, query result set D, and each query result doc corresponds to a search web page url. The degree of relevance of each query result to the user query, the tag information id, and other attribute information used for classification in the learning model. The feature space vector of the user feedback data can be used <e, doc, rating, features>^7 ^.

[0042] In this embodiment, it is meaningless to consider the query result under different user queries. In all sorting problems, all of the query results are discussed under the same user query. Rating is divided into six categories, the degree of correlation from weak to strong: Unjudged, Bad, Fair, Good, Excellent, Perfect. Relevant degree Values are 0, 1, 2, 3, 4, and 5, respectively. Sorting according to the relevance of each query result, the result of the sorting is the ideal sort.

[0043] Calculating the score of the query result, using the original learning model to learn each query result, and obtaining the result of classification on each decision tree, that is, the evaluation score.

[0044] Calculate the DCG value. For each query, the ranking result obtained after the classification result is predicted by each decision tree can be obtained. According to the correlation between the query result and the user query, the ideal order of the query results can be obtained. According to the actual sorting and ideal sorting, the value of each sub-model "DCG" can be calculated. The calculation method is as follows:

[0045] The actual sorted value, Ha:: :

[0046] where: ρ represents each query result in the query list, re _/; represents the first; position, relevance of search results?.

[0047] the value of the ideal ranking result,

[0048] then available

~ ., ~ 》 ::

Chaos - - "-

[0049] The feature vector space model can then be expressed as: < query, nDCG nDCG ₂ , , nDCG _n >.

[0050] Clustering, for each user query query, the user query can obtain the value of DCG on all sub-models, and the DCG curve can be obtained. According to the variation curve of DCG, the decision tree construction submodel m ^[t _rl , t _r2 t _ra ] which makes the «DCG have the maximum value can be extracted, and the sub-model training results can obtain the largest "Z) CG value. "The larger the DCG value, the closer the actual ranking result is to the ideal ranking result. The same indicates that the sub-model "^=[^ ₁₂ ,..., ^ can achieve the highest prediction accuracy and has the best prediction performance.

[0051] extracting the attributes of the category, and obtaining the clustering result =^;, 0.., ^, after analyzing each class All members, and extract some attributes as the feature space vector of the category. The mathematical representation can be expressed as: The feature space vector F of class C _; = ;, / _{; 2} ,..., .

[0052] learning an unknown user query, predicting an unknown user query, first extracting the attribute of the query g„, including the length of the user query, the number of times the user searches the query, and the The words that appear in the query and the number of times each word appears. The user queries the feature vector of ^/„={1^, 1\ ₂ ,..., 1_}. After obtaining the feature vector of the user query, the distance calculated by the user to each category is calculated, and the distance is the smallest category. That is, the classification result of the user query. Let the classification result be represented by c, then c, the calculation method is as follows

, Sa ^ ( ∑ 》

[0053] Using the optimal sub-model of c to classify the currently obtained user query results, the evaluation score of each user query result can be obtained. Sort user query results based on evaluation scores.

[0054]

Compared with the prior art, the perspective analysis method of the machine learning model provided by the present invention analyzes the sub-models inside the learning model, filters the sub-models with poor classification results, and selects the sub-models with better classification results, and The selected sub-models are reorganized to generate a new learning model, and the resulting new learning model has higher prediction accuracy.

[0056]

[0057] It is to be understood that those skilled in the art can make equivalent substitutions or changes in accordance with the technical solutions of the present invention and the inventive concept thereof, and all such changes or substitutions should belong to the appended claims. protected range.

Claims

Claim

A perspective analysis method of a machine learning model, characterized in that the method comprises the following steps:

Step 1: Collect error data fed back by the user and extract basic information, and extract relevant information in the feedback data to generate a feature space vector;

Step 2: Calculate the score of the query result, use the original model and the sub-model to classify the user query result, and obtain the classification result, that is, the evaluation score;

Step 3: For each user query, calculate the DCG value of the query result. According to the training result of the machine learning model, the actual sorting can be obtained. According to the query result and the user query, the ideal sort can be obtained, and the actual sorting and the ideal sorting are instant. The value of the DCG obtained by the user is obtained; step four, clustering, according to the trend of DCG value, obtaining an optimal sub-model of each query, and clustering the user query according to the similarity of the sub-model;

Step 5: Extract attributes, analyze all member information in each class, and extract some attributes as feature space vectors of the class;

Step 6: Learning an unknown user query, when an unknown user query is given, analyzing its attributes, and classifying the user query, thereby obtaining the optimal sub-model corresponding to the user query. .

The perspective analysis method of the machine learning model according to claim 1, wherein: the user feedback data collected in the step 1 includes attribute information of a series of query results. The perspective analysis method of the machine learning model according to claim 1, wherein: in step 2, the original learning model is used to learn each query result, and the result of classification on each decision tree is obtained. Calculate the score of the query result.

The perspective analysis method of the machine learning model according to claim 1, wherein: the result of the query in the third step is obtained by predicting the classification result obtained by each decision tree.

According to the query result and the user query relevance, the ideal order of the query results can be obtained.

Calculate the value of each submodel "DCG" based on the actual order and the ideal order.

The perspective analysis method of the machine learning model according to claim 1, wherein: in step 4, for each user query, the user query can be obtained on all submodels. «The value of DCG, the same can get the "DCG curve", according to the "DCG curve, you can extract the decision tree structure sub-model that makes DCG have the maximum value.

[Claim 6] The perspective analysis method of the machine learning model according to claim 1, wherein: after the step 5 obtains the clustering result, all members in each class are analyzed, and some attributes are extracted as categories. Feature space vector.