TWI771284B

TWI771284B - Method and device for predicting user problems based on data-driven

Info

Publication number: TWI771284B
Application number: TW106102498A
Authority: TW
Inventors: 薛少飛; 張家興; 崔恒斌
Original assignee: 香港商阿里巴巴集團服務有限公司
Priority date: 2017-01-23
Filing date: 2017-01-23
Publication date: 2022-07-21
Also published as: TW201828165A

Abstract

本發明揭露了一種基於資料驅動預測使用者問題的方法及裝置，所述方法當收到使用者提出的問題時，採集使用者行為資料並進行預處理，並從預處理後的使用者行為資料中截取對使用者提出的問題有貢獻的待選行為資料，通過設定的目標行為資料集合對待選行為資料進行篩選，從待選行為資料中篩選出目標行為資料集合包含的待選行為資料，將篩選出的待選行為資料輸入訓練好的分類器模型，預測出使用者提出的問題所屬的類別。本發明的裝置包括預處理模組、截取模組和預測模組。本發明的方法及裝置能夠顯著提升預測效果。 The present invention discloses a method and device for predicting user problems based on data driving. When receiving a question raised by a user, the method collects user behavior data and preprocesses it, and obtains the preprocessed user behavior data from the user behavior data. Intercept the candidate behavior data that contribute to the questions raised by the user, filter the candidate behavior data through the set target behavior data set, and select the candidate behavior data contained in the target behavior data set from the candidate behavior data. The filtered candidate behavior data is input into the trained classifier model to predict the category of the question raised by the user. The device of the present invention includes a preprocessing module, an intercepting module and a prediction module. The method and device of the present invention can significantly improve the prediction effect.

Description

Method and device for predicting user problems based on data-driven

本發明關於資料處理技術領域，尤其關於一種基於資料驅動預測使用者問題的方法及裝置。 The present invention relates to the technical field of data processing, and in particular, to a method and device for predicting user problems based on data-driven.

使用者在使用產品或者服務的時候經常會遇到自己無法處理的問題，進而會尋求客服幫助。通常客服人員需要和使用者經過多輪對話才能確定使用者遇到的是什麼問題，這樣需要投入大量的人力成本。如果能夠提前對使用者可能遇到的問題做出預測，則可以智慧推送相關答案或幫助客服人員更有效的定位使用者問題。 Users often encounter problems that they cannot handle when using products or services, and then seek customer service help. Usually, the customer service personnel need to go through multiple rounds of dialogue with the user to determine what problem the user encounters, which requires a lot of labor costs. If you can predict the problems that users may encounter in advance, you can intelligently push relevant answers or help customer service personnel to locate user problems more effectively.

提前對使用者可能遇到的問題做出預測是一個典型的多分類問題，通常由特徵選擇和模型建模兩個部分組成。在已有的方法中，特徵選擇端提取特徵時，通常由人為設定一些規則，這些規則從經驗上被認為與使用者可能提問的問題相關，如該使用者是否開通了某種服務、在過去幾天內是否有過消費記錄等。通過與這些規則的匹配可以得到描述該使用者提問前狀態的特徵。而後採用邏輯迴歸的技術對這些特徵進行建模，得到分類器並用於進行新特徵的預測。 Predicting the problems that users may encounter in advance is a typical multi-classification problem, which usually consists of two parts: feature selection and model modeling. In the existing methods, when the feature selection end extracts features, some rules are usually set by humans, and these rules are empirically considered to be related to the questions that the user may ask, such as whether the user has opened a certain service, in the past Whether there has been a consumption record within a few days, etc. By matching with these rules, the characteristics describing the user's state before questioning can be obtained. These features are then modeled using logistic regression techniques to obtain classifiers that are used for new features Prediction.

在現有技術中，由人為設定一些經驗上被認為與使用者可能遇到問題相關的規則，通過與這些規則的匹配得到描述該使用者提問前狀態的特徵。這存在兩個問題：1.並非資料驅動，而是需要強烈的人為干預，並要求干預者充分瞭解和熟悉相應產品或業務，在產品變動頻繁或業務覆蓋範圍擴展時會引入許多不便，可擴展性不強。2.未能考慮到使用者在尋求客服人員幫助前短時間內的行為與該使用者問題之間的關係，通常使用者在尋求客服人員幫助前短時間內(例如2小時內)會有一系列的行為，這些行為包括但不限於手機、平板使用者端點擊、PC網頁瀏覽以及其它由該使用者進行的操作，這其中包含了使用者在提問前的行為軌跡資訊，理論上這些行為軌跡與使用者後續求助存在強烈關聯。 In the prior art, some rules that are empirically considered to be related to the problems the user may encounter are manually set, and the features describing the state of the user before the question are obtained by matching with these rules. There are two problems in this: 1. It is not driven by data, but requires strong human intervention, and requires the intervener to fully understand and be familiar with the corresponding product or business, which will introduce a lot of inconvenience when the product changes frequently or the business coverage expands. Sex is not strong. 2. Failure to consider the relationship between the user's behavior and the user's problems within a short period of time before seeking help from the customer service staff. Usually, the user will have a series of These behaviors include but are not limited to mobile phone, tablet user-side clicks, PC web browsing, and other operations performed by the user, which include the user's behavioral trajectory information before asking questions. In theory, these behavioral trajectories are related to There is a strong correlation between users' subsequent help-seeking.

本發明的目的是提供一種基於資料驅動預測使用者問題的方法及裝置，用於在已知一些使用者資訊和操作的情況下，在使用者未描述問題前盡可能準確的預測使用者遇到的問題，能夠避免現有技術人為干預的影響，提高了分類預測的準確性。 The object of the present invention is to provide a data-driven method and device for predicting user problems, which is used to predict the user encounters as accurately as possible before the user describes the problem when some user information and operations are known. It can avoid the influence of human intervention in the prior art and improve the accuracy of classification prediction.

為了實現上述目的，本發明技術方案如下：一種基於資料驅動預測使用者問題的方法，所述預測使用者問題的方法包括：當收到使用者提出的問題時，採集使用者行為資料並進行預處理；從預處理後的使用者行為資料中截取對使用者提出的問題有貢獻的待選行為資料；通過設定的目標行為資料集合對待選行為資料進行篩選，從待選行為資料中篩選出目標行為資料集合包含的待選行為資料，將篩選出的待選行為資料輸入訓練好的分類器模型，預測出使用者提出的問題所屬的類別。 In order to achieve the above object, the technical solution of the present invention is as follows: a method for predicting user problems based on data-driven, the method for predicting user problems includes: When receiving questions from users, collect user behavior data and preprocess it; intercept candidate behavior data that contribute to the questions raised by users from the preprocessed user behavior data; The data set is used to filter the behavior data to be selected, the selected behavior data contained in the target behavior data set is screened from the candidate behavior data, and the selected behavior data is input into the trained classifier model to predict the behavior proposed by the user. The category to which the question belongs.

進一步地，所述訓練好的分類器模型，訓練過程包括如下步驟：採集使用者反饋的問題及其對應的行為資料，對採集的使用者行為資料進行預處理；從預處理後的使用者行為資料中截取對使用者反饋的問題有貢獻的行為資料作為待選行為資料；根據所有使用者反饋的問題及其對應的待選行為資料，採用資料驅動的方法對每一個使用者反饋的問題對應的待選行為資料進行打分，並篩選出符合設定條件的目標行為資料，對所有使用者反饋的問題對應的目標行為資料取聯集構成篩選出的目標行為資料集合；根據每一個使用者反饋的問題及目標行為資料集合，訓練得到分類器模型。 Further, for the trained classifier model, the training process includes the following steps: collecting the questions fed back by users and their corresponding behavior data, and preprocessing the collected user behavior data; The behavior data that contributes to the user's feedback is intercepted from the data as the candidate behavior data; according to the questions fed back by all users and their corresponding candidate behavior data, a data-driven method is used to correspond to each user's feedback problem. The selected behavior data is scored, and the target behavior data that meets the set conditions are screened out, and the target behavior data corresponding to the questions fed back by all users is combined to form the selected target behavior data set; The problem and target behavior data sets are trained to obtain a classifier model.

進一步地，所述預處理包括：去除頻次低於設定的頻次閾值的干擾行為資料。 Further, the preprocessing includes: removing interference behavior data whose frequency is lower than a set frequency threshold.

所述預處理還包括：對使用者行為資料進行數位化標識。對行為資料進行數位化標識，是為了便於後續步驟中直接以該數位化標識來進行處理，從而不需要根據行為資料的具體資料例如網址或API名等長字串資料進行處理，處理起來更加簡單。 The preprocessing also includes: Digital identification of user behavior data. Digital identification of behavioral data is for the convenience of processing directly with the digitalized identification in subsequent steps, so that it is not necessary to process according to the specific data of behavioral data, such as URL or API name, and other long-string data, which is simpler to process. .

進一步地，本發明從預處理後的使用者行為資料中截取對使用者提出的問題有貢獻的待選行為資料採用加窗截斷的方法，所述加窗截斷包括：截取在發生問題前最近一段時間內的使用者行為資料。 Further, the present invention intercepts the candidate behavior data that contributes to the problem raised by the user from the preprocessed user behavior data and adopts the method of windowing and truncation, and the windowing and truncation includes: intercepting the most recent section before the problem occurs. User behavior data over time.

進一步地，所述對所有使用者反饋的問題對應的目標行為資料取聯集構成篩選出的目標行為資料集合之後，還包括：重新對目標行為資料集合中的目標行為資料進行數位化標識。 Further, after the target behavior data set corresponding to the questions fed back by all users is formed into the selected target behavior data set, the method further includes: re-digitizing the target behavior data in the target behavior data set.

進一步地，所述訓練得到分類器模型之前，還包括步驟：對目標行為資料集合中的目標行為資料進行向量化處理。 Further, before the classifier model is obtained from the training, the method further includes the step of: performing vectorization processing on the target behavior data in the target behavior data set.

進一步地，所述將篩選出的待選行為資料輸入訓練好的分類器模型之前，還包括：對待選行為資料進行向量化處理。向量化後的使用者行為資料可直接訓練分類器模型和用於實際預測，計算更加簡便。 Further, before inputting the selected behavior data to be selected into the trained classifier model, the method further includes: performing vectorization processing on the behavior data to be selected. The vectorized user behavior data can directly train the classifier model and be used for actual prediction, and the calculation is simpler.

本發明還提出了一種基於資料驅動預測使用者問題的裝置，所述預測使用者問題的裝置包括：預處理模組，用於當收到使用者提出的問題時，採集使用者行為資料並進行預處理；截取模組，用於從預處理後的使用者行為資料中截取對使用者提出的問題有貢獻的待選行為資料；預測模組，用於通過設定的目標行為資料集合對待選行為資料進行篩選，從待選行為資料中篩選出目標行為資料集合包含的待選行為資料，將篩選出的待選行為資料輸入訓練好的分類器模型，預測出使用者提出的問題所屬的類別。 The present invention also proposes a data-driven prediction user problem based The device for predicting user problems includes: a preprocessing module, used for collecting user behavior data and preprocessing when a question raised by a user is received; The candidate behavior data that contributes to the questions raised by the user is intercepted from the user behavior data; the prediction module is used to filter the candidate behavior data through the set target behavior data set, and select the target behavior from the candidate behavior data. The candidate behavior data contained in the data set is input into the trained classifier model to predict the category of the question raised by the user.

進一步地，所述裝置還包括模型訓練模組，用於訓練分類器模型，所述模型訓練模型在訓練分類器模型時，執行如下操作：採集使用者反饋的問題及其對應的行為資料，對採集的使用者行為資料進行預處理；從預處理後的使用者行為資料中截取對使用者反饋的問題有貢獻的行為資料作為待選行為資料；根據所有使用者反饋的問題及其對應的待選行為資料，採用資料驅動的方法對每一個使用者反饋的問題對應的待選行為資料進行打分，並篩選出符合設定條件的目標行為資料，對所有使用者反饋的問題對應的目標行為資料取聯集構成篩選出的目標行為資料集合；根據每一個使用者反饋的問題及目標行為資料集合，訓練得到分類器模型。 Further, the device also includes a model training module for training the classifier model, and the model training model performs the following operations when training the classifier model: collecting the questions and corresponding behavior data fed back by the user, The collected user behavior data is preprocessed; the behavior data that contributes to the user feedback is intercepted from the preprocessed user behavior data as the candidate behavior data; Select behavior data, use a data-driven method to score the candidate behavior data corresponding to each user feedback question, and filter out the target behavior data that meets the set conditions, and select the target behavior data corresponding to all user feedback questions. The union constitutes a set of filtered target behavior data; according to the set of questions and target behavior data fed back by each user, a classifier model is obtained by training.

本發明所述預處理模組在對採集的使用者行為資料進行預處理時，執行如下步驟：去除頻次低於設定的頻次閾值的干擾行為資料。 When preprocessing the collected user behavior data, the preprocessing module of the present invention performs the following steps: removing the interference behavior data whose frequency is lower than the set frequency threshold.

所述預處理模組還用於對使用者行為資料進行數位化標識。 The preprocessing module is also used to digitally identify the user behavior data.

進一步地，所述截取模組在從預處理後的使用者行為資料中截取對使用者提出的問題有貢獻的待選行為資料時，採用加窗截斷的方法，所述加窗截斷包括：截取在發生問題前最近一段時間內的使用者行為資料。 Further, when the interception module intercepts the candidate behavior data that contributes to the problem raised by the user from the preprocessed user behavior data, a method of windowing and truncation is adopted, and the windowing and truncation includes: intercepting User behavior data for the most recent period prior to the issue.

進一步地，所述模型訓練模組對所有使用者反饋的問題對應的目標行為資料取聯集構成篩選出的目標行為資料集合之後，還用於重新對目標行為資料集合中的目標行為資料進行數位化標識。 Further, the model training module is also used to re-digitize the target behavior data in the target behavior data set after taking the union set of the target behavior data corresponding to the questions fed back by all users to form the selected target behavior data set. identification.

進一步地，所述模型訓練模組在訓練得到分類器模型之前，還用於對目標行為資料集合中的目標行為資料進行向量化處理。 Further, the model training module is further configured to perform vectorization processing on the target behavior data in the target behavior data set before training to obtain the classifier model.

進一步地，所述預測模組在將篩選出的待選行為資料輸入訓練好的分類器模型之前，還用於對待選行為資料進行向量化處理。 Further, before inputting the selected behavior data to be selected into the trained classifier model, the prediction module is also used to perform vectorization processing on the behavior data to be selected.

本發明提出的一種基於資料驅動預測使用者問題的方法及裝置，利用使用者短時間內的行為軌跡資訊進行使用者問題的分類預測以提升分類準確率，顯著提升未包含這些資訊的模型預測效果。 A method and device for predicting user problems based on data-driven proposed by the present invention utilizes short-term behavioral trajectory information of users to classify and predict user problems to improve the classification accuracy and significantly improve the prediction effect of models that do not contain such information. .

F1:過程 F1: Process

F2:過程 F2: Process

F3:過程 F3: Process

F4:過程 F4: Process

S1:步驟 S1: Step

S2:步驟 S2: Step

S3:步驟 S3: Step

圖1為本發明訓練分類器模型的流程圖；圖2為本發明基於資料驅動預測使用者問題的方法流程圖；圖3為本發明基於資料驅動預測使用者問題的裝置結構示意圖。 FIG. 1 is a flow chart of training a classifier model according to the present invention; FIG. 2 is a flow chart of a method for predicting user problems based on data driving according to the present invention; FIG.

下面結合圖式和實施例對本發明技術方案做進一步詳細說明，以下實施例不構成對本發明的限定。 The technical solutions of the present invention will be further described in detail below with reference to the drawings and examples, and the following examples do not constitute a limitation of the present invention.

本發明的總體思想是採用訓練資料訓練出分類器模型，根據訓練的分類器模型對使用者行為資料進行分析，來預測使用者遇到的問題。 The general idea of the present invention is to use the training data to train a classifier model, and to analyze the user behavior data according to the trained classifier model to predict the problems encountered by the user.

如圖1所示，本實施例採用訓練資料訓練出分類器模型的過程如下： As shown in Figure 1, the process of using the training data to train the classifier model in this embodiment is as follows:

F1、採集使用者反饋的問題及其對應的行為資料，對採集的使用者行為資料進行預處理，預處理包括去除干擾行為資料，以及對行為資料進行數位化標識。 F1. Collect user feedback problems and corresponding behavior data, and preprocess the collected user behavior data. The preprocessing includes removing interfering behavior data, and digitally marking the behavior data.

對於任何使用者反饋的問題，都採集該使用者的行為資料，從而得到大量的行為資料。行為資料是一些使用者操作，包括手機、平板使用者端點擊、PC網頁瀏覽以及其它由該使用者進行的操作，這些操作以網址或API名表示，其前冠以unix時間戳記。例如一個使用者X在過去一段時間的行為可以表示為：1438661879：alipay.mappprod.shop.queryPage For any user feedback problem, the user's behavior data are collected, so as to obtain a large amount of behavior data. Behavioral data are some user operations, including mobile phone, tablet user-side clicks, PC web browsing, and other operations performed by the user. These operations are represented by URLs or API names, preceded by unix timestamps. For example a user X in the past A period of behavior can be represented as: 1438661879:alipay.mappprod.shop.queryPage

1438661885：alipay.client.mobileapp.checkResult 1438661885: alipay.client.mobileapp.checkResult

1438661889：alipay.commerce.category.queryByCategoryId 1438661889: alipay.commerce.category.queryByCategoryId

1438661899：alipay.siteprobe.sync.queryWifis 1438661899: alipay.siteprobe.sync.queryWifis

1438661909：alipay.charity.mobile.donate.deduct.unsign 1438661909: alipay.charity.mobile.donate.deduct.unsign

..... .....

1438661999：https：//couriercore.alipay.com/errorRepeatSubmit.htm 1438661999: https://couriercore.alipay.com/errorRepeatSubmit.htm

1438662999：https：//cshall.alipay.com/lab/question.htm 1438662999: https://cshall.alipay.com/lab/question.htm

為了更加準確和便於後續的處理，本實施例預處理包括去除干擾行為資料，以及對行為資料進行數位化標識。 In order to be more accurate and facilitate subsequent processing, the preprocessing in this embodiment includes removing interfering behavior data, and digitally identifying the behavior data.

其中去除干擾行為資料，是指出現的頻次極低的行為資料，例如低於設定的頻次閾值。這些出現頻次極低的行為資料造成使用者反饋的問題的可能比較低，本實施例不予考慮，從而排除出現頻次極低的行為資料帶來的干擾。 The removal of interfering behavior data refers to behavior data with extremely low frequency, such as lower than a set frequency threshold. These behavioral data with extremely low frequency may cause a relatively low possibility of user feedback, which is not considered in this embodiment, so as to eliminate the interference caused by behavioral data with extremely low frequency.

其中對行為資料進行數位化標識，是為了便於後續步驟中直接以該數位化標識來進行處理，從而不需要根據行為資料的具體資料例如網址或API名等長字串資料進行處理，處理起來更加簡單。 The digital identification of the behavioral data is for the convenience of directly processing the digitalized identification in the subsequent steps, so that it is not necessary to process according to the specific data of the behavioral data, such as URL or API name and other long-string data, which makes it easier to process. Simple.

對行為資料進行數位化標識，可以將以上行為資料的網址或API按照事先準備好的映射表進行數位化標識；或通過對行為資料出現的頻次進行統計，按照頻次數量的大小進行排序編號，以該編號作為行為資料的數位化標識；或者根據行為資料的具體內容通過HASH計算得到其對應的數位化標識。數位化標識後的行為資料變為：1438661879：2 To digitally identify the behavioral data, the website or API of the above behavioral data can be digitally identified according to the mapping table prepared in advance; The number serves as a digitized identifier for the behavioral data; Or according to the specific content of the behavior data, the corresponding digitized identification can be obtained through HASH calculation. The behavior data after digitizing the logo becomes: 1438661879:2

1438661885：65 1438661885:65

1438661889：11 1438661889:11

1438661899：6 1438661899:6

1438661909：18 1438661909:18

..... .....

1438661999：108 1438661999:108

1438662999：111 1438662999:111

在後續步驟中直接以該數位化標識來進行篩選和處理。 The digitized identification is directly used for screening and processing in subsequent steps.

F2、從預處理後的使用者行為資料中截取對使用者反饋的問題有貢獻的待選行為資料。 F2. Intercept from the preprocessed user behavior data, the candidate behavior data that contributes to the user's feedback.

對於大量的使用者行為資料，真正對使用者反饋的問題帶來影響的往往是使用者在發生問題前最近一段時間內的行為資料。即對使用者反饋的問題有貢獻的行為資料是使用者最近時間的行為資料，歷史行為資料可以忽略其影響。因此本實施例需要截取使用者行為資料，選擇使用者最近時間的行為資料作為待選行為資料。 For a large amount of user behavior data, what really affects the problems reported by users is often the behavior data of users in the recent period before the problem occurs. That is, the behavior data that contributes to the user's feedback is the user's recent behavior data, and the influence of historical behavior data can be ignored. Therefore, in this embodiment, the user behavior data needs to be intercepted, and the behavior data of the user in the most recent time is selected as the candidate behavior data.

具體地，通過加窗來進行截取，可以選擇固定窗長或可變窗長。固定窗長例如30-120個行為資料，即從當前行為資料往前選取30-120個行為資料；可變窗長是從當前行為資料往前選取一定時長的行為資料，例如當前時間往前0.5小時-2小時內的行為資料。 Specifically, the interception is performed by adding a window, and a fixed window length or a variable window length can be selected. For example, the fixed window length is 30-120 behavior data, that is, 30-120 behavior data are selected from the current behavior data forward; the variable window length is from the current behavior data. The previous behavior data is selected from the behavior data of a certain period of time, for example, the behavior data within 0.5 hours to 2 hours before the current time.

例如，對於上述行為資料，加窗截斷時從最後一個行為資料，即1438662999：111往前回溯，長度固定窗長(30-120個資料)或可變窗長(0.5小時-2小時，通過unix時間戳記確定)。假設通過加窗截斷後資料變為：1438661885：65 For example, for the above behavior data, when windowing is truncated from the last behavior data, that is, 1438662999:111, backtracking, the length is fixed window length (30-120 data) or variable window length (0.5 hours-2 hours, through unix time stamp ok). Suppose that the data becomes: 1438661885:65 after windowing and truncation

1438661889：11 1438661889:11

1438661899：6 1438661899:6

1438661909：18 1438661909:18

..... .....

1438661999：108 1438661999:108

1438662999：111 1438662999:111

從而得到對使用者反饋的問題有貢獻的待選行為資料，遍歷每一個使用者反饋的問題對應的行為資料，得到每一個使用者反饋的問題對應的待選行為資料。 Thereby, the candidate behavior data that contributes to the question fed back by the user is obtained, the behavior data corresponding to each user feedback question is traversed, and the candidate behavior data corresponding to each user feedback question is obtained.

F3、根據所有使用者反饋的問題及其對應的待選行為資料，採用資料驅動的方法對每一個使用者反饋的問題對應的待選行為資料進行打分，並篩選出符合設定條件的目標行為資料，對所有使用者反饋的問題對應的目標行為資料取聯集構成篩選出的目標行為資料集合。 F3. According to the questions fed back by all users and the corresponding candidate behavior data, use a data-driven method to score the candidate behavior data corresponding to each user feedback question, and filter out the target behavior data that meets the set conditions. , the target behavior data set corresponding to the questions fed back by all users is taken to form the selected target behavior data set.

本實施例將所有使用者反饋的問題作為文件集，每一個使用者反饋的問題作為一個文件。本實施例資料驅動的方法為TF-IDF方法，TF-IDF方法是一種用於資訊檢索與資料採擷的常用加權技術統計方法，用以評估一字詞對於一個文件集或一個語料庫中的其中一份文件的重要程度。字詞的重要性隨著它在文件中出現的次數成正比增加，但同時會隨著它在語料庫中出現的頻率成反比下降。對於本實施例來說，字詞相當於待選行為資料，所有使用者反饋的問題作為文件集，每一個使用者反饋的問題作為一個文件，通過TF-IDF對每一個使用者反饋的問題對應的待選行為資料進行打分。 In this embodiment, the questions fed back by all users are regarded as a file set, and the questions fed back by each user are regarded as a document. data-driven The method is the TF-IDF method. The TF-IDF method is a common weighted statistical method used for information retrieval and data acquisition to evaluate the importance of a word to a document set or a document in a corpus. The importance of a word increases proportionally to the number of times it appears in the document, but decreases inversely to the frequency it appears in the corpus. For this embodiment, the words are equivalent to the behavior data to be selected, all the questions fed back by the users are regarded as a file set, the questions fed back by each user are regarded as a document, and the questions fed back by each user are corresponding to the questions fed back by each user through TF-IDF. The candidate behavior data are scored.

TF-IDF的主要思想是：如果某個詞或短語在一篇文章中出現的頻率TF高，並且在其他文章中很少出現，則認為此詞或者短語具有很好的類別區分能力，適合用來分類。TFIDF實際上是：TFXIDF，TF詞頻(Term Frequency)，IDF逆向文件頻率(Inverse Document Frequency)。TF表示詞條，在文件d中出現的頻率。IDF的主要思想是：如果包含詞條t的文件越少，也就是n越小，IDF越大，則說明詞條t具有很好的類別區分能力。如果某一類C．中包含詞條t的文件數為m，而其它類包含t的文件總數為k，顯然所有包含t的文件數n=m+k，當m大的時候，n也大，按照IDF公式得到的IDF的值會小，就說明該詞條t類別區分能力不強。 The main idea of TF-IDF is: if a word or phrase appears frequently in one article TF, and rarely appears in other articles, it is considered that this word or phrase has a good ability to distinguish between categories, suitable for classification. TFIDF is actually: TFXIDF, TF Term Frequency, IDF Inverse Document Frequency. TF stands for the term, the frequency of occurrence in document d. The main idea of IDF is: if there are fewer documents containing term t, that is, the smaller n is, and the larger the IDF is, it means that term t has a good ability to distinguish between categories. If a class C. The number of files containing entry t is m, and the total number of files containing t in other classes is k. Obviously, the number of files containing t is n=m+k. When m is large, n is also large, according to the IDF formula. The value of IDF will be small, which means that the classification ability of the entry t is not strong.

詳細演算法如下： The detailed algorithm is as follows:

第一步，計算詞頻。 The first step is to calculate the word frequency.

第二步，計算逆向文件頻率。 The second step is to calculate the reverse file frequency.

第三步，計算TF-IDF。 The third step is to calculate TF-IDF.

TF-IDF=詞頻(TF)×逆向文件頻率(IDF) TF-IDF = term frequency (TF) × inverse document frequency (IDF)

使用者行為資料從某種程度上可以看作是字詞，字詞對於一個文件集或一個語料庫中的其中一份文件的重要程度，將TF-IDF技術借鑒過來，應用於行為資料的篩選，篩選出適合用來分類的行為資料，將這些篩選出的行為資料稱為目標行為資料，本實施例通過對每一個使用者反饋的問題取打分最高的前N(50-200)個或高於一定閾值的行為資料，作為目標行為資料。並對所有使用者反饋的問題對應的目標行為資料取聯集構成篩選出的目標行為資料集合，該集合包含的行為資料數量遠小於所有訓練資料中的行為資料數量。 User behavior data can be regarded as words to some extent, and the importance of words to one of the documents in a document set or a corpus. The TF-IDF technology is borrowed and applied to the screening of behavior data. Screen out the behavior data suitable for classification, and call these screened out behavior data as the target behavior data. In this embodiment, the top N (50-200) with the highest score or higher than the questions fed back by each user are selected. The behavior data of a certain threshold is used as the target behavior data. A set of target behavior data corresponding to the questions fed back by all users is obtained to form a set of filtered target behavior data, and the amount of behavior data contained in the set is much smaller than that in all training data.

例如問題A對應的行為資料為(以數位化標識來表示)：6 18 1 9 77 98 69..........................................................88 189 87 For example, the behavior data corresponding to question A is (represented by a digitized identifier): 6 18 1 9 77 98 69 ...................... .................................88 189 87

對所有的數位化標識進行TF-IDF打分，並從高到低取對問題A最重要的top N(如50)，可以得到對問題A最重要的行為資料：A：11 18........108....... Perform TF-IDF scores on all digitized identifiers, and take the most important top N (such as 50) for problem A from high to low, and you can get the most important behavior data for problem A: A: 11 18.... ......108......

取所有問題對應的數位化標識集合的聯集即構成了目標行為資料集合，可見當上述使用者反饋的問題經過識別為已知的問題時，上述目標行為資料集合包含了所有已知問題的目標行為資料。 The union of the digital identification sets corresponding to all the questions constitutes the target behavior data set. It can be seen that when the above-mentioned user feedback questions are identified In the case of known problems, the above target behavior data collection contains target behavior data for all known problems.

進一步地，還對目標行為資料集合中的目標行為資料進行重新數位化標識，使得該集合更加簡單，便於進行後續處理。 Further, the target behavior data in the target behavior data set is re-digitized and identified, which makes the set simpler and facilitates subsequent processing.

F4、根據每一個使用者反饋的問題及目標行為資料集合，訓練得到分類器模型。 F4. According to the set of questions and target behavior data fed back by each user, a classifier model is obtained by training.

利用已知問題及其對應的目標行為資料訓練出分類器模型，通過該分類器模型，從而當有使用者反饋了一個問題時，能夠通過對使用者的行為資料進行分析，預測出該問題可能是哪一個已知的問題，從而便於客服回答使用者的問題，並給出解決辦法。 A classifier model is trained by using known problems and their corresponding target behavior data. Through the classifier model, when a user feedbacks a problem, the user's behavior data can be analyzed to predict the possibility of the problem. Which is a known problem, so that the customer service can answer the user's question and provide a solution.

分類器模型包括但不限於邏輯迴歸模型、深度神經網路模型、支援向量機模型、遞迴神經網路模型等。鑒於現有技術根據訓練資料進行訓練得到模型的方法比較多，這裡不再贅述。 Classifier models include, but are not limited to, logistic regression models, deep neural network models, support vector machine models, recurrent neural network models, and the like. In view of the fact that there are many methods for obtaining a model by training according to the training data in the prior art, details are not repeated here.

如圖2所示，本實施例一種基於資料驅動預測使用者問題的方法，包括： As shown in FIG. 2 , a method for predicting user problems based on data driving in this embodiment includes:

步驟S1、當收到使用者提出的問題時，採集使用者行為資料並進行預處理。 Step S1, when a question raised by a user is received, user behavior data is collected and preprocessed.

步驟S2、從預處理後的使用者行為資料中截取對使用者提出的問題有貢獻的行為資料作為待選行為資料。 Step S2, intercepting behavior data that contributes to the question raised by the user from the preprocessed user behavior data as candidate behavior data.

客服接收到使用者提出的問題後，則可以抓取使用者行為資料進行預處理，關於預處理的具體辦法以及如何進行加窗截斷，在上文訓練分類器模型時已經描述，這裡不再贅述。 After receiving the question raised by the user, the customer service can capture the user behavior data for preprocessing. The specific methods of preprocessing and how to proceed. The row windowing truncation has been described in the training of the classifier model above, and will not be repeated here.

步驟S3、通過設定的目標行為資料集合對待選行為資料進行篩選，從待選行為資料中篩選出目標行為資料集合包含的待選行為資料，將篩選出的待選行為資料輸入訓練好的分類器模型，預測出使用者提出的問題所屬的類別。 Step S3, screening the to-be-selected behavior data through the set target behavior data set, screen out the target behavior data set from the candidate behavior data, and input the selected behavior data into the trained classifier A model that predicts the category of questions asked by users.

例如，使用者X的行為資料通過目標行為資料集合的篩選後即變為：1438661889：11 For example, the behavior data of user X becomes: 1438661889:11 after filtering through the target behavior data set

1438661909：18 1438661909:18

..... .....

1438661999：108 1438661999:108

假設目標行為資料集合中不包括1438661885：65，1438661899：6，1438662999：111，則該三條資料會被去掉，因為其不包含在目標行為資料集合中。 Assuming that 1438661885:65, 1438661899:6, and 1438662999:111 are not included in the target behavior data set, these three pieces of data will be removed because they are not included in the target behavior data set.

由於前面已經通過篩選得到了目標行為資料集合，並訓練得到了分類器模型。因此在有使用者向客服提交問題時，客服就能夠將使用者的待選行為資料提交給訓練好的分類器模型，分類器模型計算出使用者提出的問題具體是哪一類的問題，輸出對應於不同問題的概率，選擇概率最高的問題作為使用者提出的問題所屬的類別。 Because the target behavior data set has been obtained through screening, and the classifier model has been trained. Therefore, when a user submits a question to the customer service, the customer service can submit the user's candidate behavior data to the trained classifier model, and the classifier model calculates the specific type of problem raised by the user, and outputs the corresponding According to the probability of different questions, the question with the highest probability is selected as the category of the question raised by the user.

進一步地，為了便於訓練分類器模型，以及後續的預測，本實施例一種基於資料驅動預測使用者問題的方法，還分別對目標行為資料集合中的目標行為資料進行向量化處理，以及對待選行為資料進行向量化處理。進行向量化處理 Further, in order to facilitate the training of the classifier model, as well as the subsequent pre- In this embodiment, a data-driven method for predicting user problems further performs vectorization processing on the target behavior data in the target behavior data set, and vectorization processing on the behavior data to be selected. vectorize

向量化處理分為二值化和數量化，二值化指出現則在對應向量位置設置1，不出現設置0；數量化指在對應向量位置該行為出現的次數。向量化後的使用者行為資料可直接訓練分類器模型和用於實際預測，也可以和原有特徵結合訓練分類器模型和用於實際預測。 The vectorization process is divided into binarization and quantization. Binarization refers to setting 1 at the corresponding vector position when it occurs, and setting 0 when it does not. Quantization refers to the number of times the behavior occurs at the corresponding vector position. The vectorized user behavior data can directly train the classifier model and be used for actual prediction, or it can be combined with the original features to train the classifier model and used for actual prediction.

如圖3所示，與上述方法對應地，本發明還提出了一種基於資料驅動預測使用者問題的裝置，該裝置包括：預處理模組，用於當收到使用者提出的問題時，採集使用者行為資料並進行預處理；截取模組，用於從預處理後的使用者行為資料中截取對使用者提出的問題有貢獻的待選行為資料；預測模組，用於通過設定的目標行為資料集合對待選行為資料進行篩選，從待選行為資料中篩選出目標行為資料集合包含的待選行為資料，將篩選出的待選行為資料輸入訓練好的分類器模型，預測出使用者提出的問題所屬的類別。 As shown in FIG. 3 , corresponding to the above method, the present invention also proposes a device for predicting user problems based on data driving. The device includes: a preprocessing module, which is used for collecting data when receiving a problem raised by a user. User behavior data and preprocessing; interception module, used to intercept candidate behavior data that contributes to the questions raised by users from the preprocessed user behavior data; prediction module, used to pass the set goal The behavior data set filters the behavior data to be selected, and selects the behavior data contained in the target behavior data set from the behavior data to be selected, and inputs the selected behavior data into the trained classifier model, and predicts that the user proposes The category the question belongs to.

本實施例預測使用者問題的裝置還包括模型訓練模組，用於訓練分類器模型，模型訓練模型在訓練分類器模型時，執行如下操作：採集使用者反饋的問題及其對應的行為資料，對採集的使用者行為資料進行預處理；從預處理後的使用者行為資料中截取對使用者反饋的問題有貢獻的行為資料作為待選行為資料；根據所有使用者反饋的問題及其對應的待選行為資料，採用資料驅動的方法對每一個使用者反饋的問題對應的待選行為資料進行打分，並篩選出符合設定條件的目標行為資料，對所有使用者反饋的問題對應的目標行為資料取聯集構成篩選出的目標行為資料集合；根據每一個使用者反饋的問題及目標行為資料集合，訓練得到分類器模型。 The device for predicting user problems in this embodiment further includes a model training module for training the classifier model. When training the classifier model, the model training model performs the following operations: collecting user feedback problems and their corresponding behavior data, pair collection preprocess the user behavior data; intercept the behavior data that contributes to the user feedback from the preprocessed user behavior data as the candidate behavior data; based on all the user feedback questions and their corresponding candidates Behavior data, using a data-driven method to score the candidate behavior data corresponding to each user feedback question, and filter out the target behavior data that meets the set conditions, and connect the target behavior data corresponding to all user feedback questions. The set constitutes a set of selected target behavior data; according to the set of questions and target behavior data fed back by each user, a classifier model is obtained by training.

本實施例預處理模組在對採集的使用者行為資料進行預處理時，執行如下步驟：去除頻次低於設定的頻次閾值的干擾行為資料。 When the preprocessing module in this embodiment preprocesses the collected user behavior data, the following steps are performed: removing the interference behavior data whose frequency is lower than the set frequency threshold.

進一步地，預處理模組還用於對使用者行為資料進行數位化標識。 Further, the preprocessing module is also used to digitally identify the user behavior data.

本實施例截取模組在從預處理後的使用者行為資料中截取對使用者提出的問題有貢獻的待選行為資料時，採用加窗截斷的方法，所述加窗截斷包括：截取在發生問題前最近一段時間內的使用者行為資料。 In this embodiment, when the interception module intercepts the candidate behavior data that contributes to the question raised by the user from the preprocessed user behavior data, a method of windowing and truncation is adopted, and the windowing and truncation includes: User behavior data for the most recent period prior to the issue.

與上述方法對應地，本實施例模型訓練模組對所有使用者反饋的問題對應的目標行為資料取聯集構成篩選出的目標行為資料集合之後，還用於重新對目標行為資料集合中的目標行為資料進行數位化標識。 Corresponding to the above method, the model training module of this embodiment is also used to re-analyze the target behavior data set in the target behavior data set after forming the selected target behavior data set from the target behavior data set corresponding to the questions fed back by all users. Behavioral data is digitally identified.

本實施例模型訓練模組在訓練得到分類器模型之前，還用於對目標行為資料集合中的目標行為資料進行向量化處理。 The model training module of this embodiment is further configured to perform vectorization processing on the target behavior data in the target behavior data set before training to obtain the classifier model.

與上述方法對應地，本實施例預測模組在將篩選出的待選行為資料輸入訓練好的分類器模型之前，還用於對待選行為資料進行向量化處理。 Corresponding to the above method, the prediction module of this embodiment is also used to perform vectorization processing on the to-be-selected behavior data before inputting the screened out to-be-selected behavior data into the trained classifier model.

以上實施例僅用以說明本發明的技術方案而非對其進行限制，在不背離本發明精神及其實質的情況下，所屬技術領域中具有通常知識者當可根據本發明作出各種相應的改變和變形，但這些相應的改變和變形都應屬於本發明所附的申請專利範圍的保護範圍。 The above embodiments are only used to illustrate the technical solutions of the present invention but not to limit them. Those with ordinary knowledge in the technical field can make various corresponding changes according to the present invention without departing from the spirit and essence of the present invention. and deformation, but these corresponding changes and deformations should all belong to the protection scope of the appended patent application scope of the present invention.

Claims

A method for predicting user problems based on data-driven, the method for predicting user problems includes: collecting user behavior data and performing preprocessing when receiving a question raised by a user, wherein the preprocessing includes: removing Interfering behavior data whose frequency is lower than the set frequency threshold; adopting the method of windowing and truncation to intercept the candidate behavior data that contributes to the problem raised by the user from the preprocessed user behavior data, and the windowing and truncation includes: Intercept the user behavior data in the most recent period before the problem occurs; filter the to-be-selected behavior data through the set target behavior data set, and select the candidate behavior data contained in the target behavior data set from the candidate behavior data, and then filter The selected behavior data is input into the classifier model, and the category of the question raised by the user is predicted.

According to the data-driven method for predicting user problems according to item 1 of the scope of the application, wherein, the training process of the classifier model includes the following steps: collecting user feedback problems and their corresponding behavior data; The preprocessing is performed on the user behavior data; the behavior data that contributes to the user's feedback is intercepted from the preprocessed user behavior data as the candidate behavior data; according to all the user feedback questions and their corresponding Select behavior data, use a data-driven method to score the candidate behavior data corresponding to each user feedback question, and filter out the goals that meet the set conditions Behavior data, the target behavior data set corresponding to the questions fed back by all users is combined to form the selected target behavior data set; the classifier model is obtained by training according to the set of questions and target behavior data fed back by each user.

According to the data-driven method for predicting user problems according to item 1 of the scope of the patent application, the preprocessing further comprises: digitally identifying the user behavior data.

According to the data-driven method for predicting user problems according to item 3 of the scope of the patent application, after the target behavior data set corresponding to the questions fed back by all users is formed into the selected target behavior data set, the Including: digitally marking the target behavior data in the target behavior data collection again.

According to the data-driven method for predicting user problems according to item 2 of the scope of the patent application, before the classifier model is obtained from the training, the method further includes the step of: vectorizing the target behavior data in the target behavior data set.

According to the data-driven method for predicting user problems according to item 5 of the scope of the patent application, before inputting the screened candidate behavior data into the classifier model, the method further includes: vectorizing the to-be-selected behavior data.

A data-driven device for predicting user problems, the device for predicting user problems includes: The preprocessing module is used to collect and preprocess the user behavior data when receiving the questions raised by the user, wherein the preprocessing module performs the following steps when preprocessing the collected user behavior data Steps: remove the interference behavior data whose frequency is lower than the set frequency threshold; the interception module is used for intercepting the candidate behaviors that contribute to the problem raised by the user from the preprocessed user behavior data by adopting the method of windowing and truncation The windowing and truncation includes: intercepting the user behavior data in the most recent period before the problem occurs; the prediction module is used to filter the to-be-selected behavior data through the set target behavior data set, and from the to-be-selected behavior data The candidate behavior data included in the target behavior data set is filtered out, and the filtered candidate behavior data is input into the classifier model to predict the category of the question raised by the user.

The device for predicting user problems based on data-driven claims according to claim 7, wherein the device further includes a model training module for training a classifier model, and the model training model is used when training the classifier model. , and perform the following operations: collect the questions fed back by the users and their corresponding behavior data, and perform the preprocessing on the collected user behavior data; The behavior data of each user is used as the candidate behavior data; according to the questions fed back by all users and the corresponding candidate behavior data, the data-driven method is used to score the candidate behavior data corresponding to each user feedback question, and screen out The target behavior data that meets the set conditions, and the target behavior data corresponding to the questions fed back by all users The union set constitutes a set of filtered target behavior data; a classifier model is obtained by training according to the set of questions and target behavior data fed back by each user.

According to the device for predicting user problems based on data-driven claimed in claim 7, wherein the preprocessing module is further used to digitally identify the user behavior data.

According to the data-driven device for predicting user problems according to item 9 of the scope of the patent application, wherein the model training module takes a union set of target behavior data corresponding to the questions fed back by all users to form the filtered target behavior data After the collection, it is also used to digitally identify the target behavior data in the target behavior data collection.

According to the device for predicting user problems based on data driven according to item 8 of the scope of the patent application, the model training module is further configured to perform a training operation on the target behavior data in the target behavior data set before training to obtain the classifier model. vectorized processing.

The device for predicting user problems based on data-driven claims according to claim 11, wherein the prediction module is further configured to perform a data-to-be-selected behavior analysis before inputting the selected behavior data into the classifier model. vectorized processing.