CN107590224A

CN107590224A - User preference analysis method and device based on big data

Info

Publication number: CN107590224A
Application number: CN201710786530.7A
Authority: CN
Inventors: 王颖帅; 李晓霞; 苗诗雨
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2017-09-04
Filing date: 2017-09-04
Publication date: 2018-01-16
Anticipated expiration: 2037-09-04
Also published as: CN107590224B

Abstract

The disclosure provides a kind of user preference analysis method and device based on big data.Method includes：The interbehavior data of user and content are obtained, the content has at least one label；The interbehavior data are pre-processed and generate characteristic data set, are the input feature vector value as gcForest models using the characteristic data set；Using input feature vector of the class probability vector of each level connection forest output in the gcForest models with the feature of the characteristic data set as next level connection forest；The class probability vector for joining forest output according to described last level of gcForest models obtains preference probability of the user to the label.The user preference analysis method that the disclosure provides can be based on big data sample and provide more accurate user preference analysis result.

Description

User preference analysis method and device based on big data

Technical field

This disclosure relates to machine learning techniques field, is analyzed in particular to a kind of user preference based on big data Method and apparatus.

Background technology

With the development of Internet technology, content personalization is carried out to user and recommends increasingly to popularize.It is recommended as with article Example, by being that every article sets one or more labels according to article content, and operation of the user to article is obtained, can analyzed Go out user has preference to which label, so as to recommend other articles under these labels for user, lift Consumer's Experience.

In existing personalized recommendation technology, analyzing the method for user preference mainly includes being based on LR logistic regression algorithms Analytic approach and statistical formula scoring is drawn according to time weighting to each feature based on analyst's strategy.Returned based on LR logics In reduction method analytic approach, Data Analyst needs according to business experience analysis to need which feature extracted, and in which way Labelled to content.After feature and label data is obtained, stratified sampling is carried out to different labels, utilizes statistical analysis software Logic Regression Models obtain the coefficient of each feature, so that it is determined that user tag preference-score formula.Counted based on time weighting Scoring assumes that user to the content more preference of the content of nearest selection of time than slightly remote selection of time, so as to be weighed according to the time A data are safeguarded again, that is, look for a suitable function to determine the time weighting of 1 year 365 day every day, are finally combined each special Obtain out the statistical formula of having time dimension.

In the above-described techniques, LR logistic regressions Algorithm Analysis method needs analyst to determine each feature according to business experience Coefficient, analyst's experience is strongly dependent upon, and each business is required for manual analyzing, less efficient, sample number is small.And due to user It is different to the preference of content in the different periods, it is difficult to when finding most suitable time weighting function, therefore being based on Between weight statistics scoring be also difficult to accurately excavate user preference.

Therefore, a kind of user preference analysis method algorithm pair that can be handled great amount of samples and more accurate analysis result is provided Measurer is clicked in lifting personalized recommendation ability and lifting content to be of great importance.

It should be noted that information is only used for strengthening the reason to the background of the disclosure disclosed in above-mentioned background section Solution, therefore can include not forming the information to prior art known to persons of ordinary skill in the art.

The content of the invention

The purpose of the disclosure is to provide a kind of user preference analysis method and device based on big data, at least existing One or more problems caused by the limitation of correlation technique and defect are overcome to a certain extent.

According to the first aspect of the embodiment of the present disclosure, there is provided a kind of user preference analysis method based on big data, including： The interbehavior data of user and content are obtained, the content has at least one label；The interbehavior data are carried out Pre-process and generate characteristic data set, be the input feature vector value as gcForest models using the characteristic data set；By described in The class probability vector of each level connection forest output and the feature of the characteristic data set are as next layer in gcForest models Cascade the input feature vector of forest；The class probability vector for joining forest output according to described last level of gcForest models obtains use Preference probability of the family to the label.

In a kind of exemplary embodiment of the disclosure, it is right in preset time period that the interbehavior data include user The data of the operation of the content, the data include browse number, thumb up number, share number, comment on number, check details number, under Odd number.

In a kind of exemplary embodiment of the disclosure, carrying out pretreatment to the interbehavior data includes：Judge institute State and whether there is missing data in interbehavior data, if there is then supplementing missing data；Delete the interbehavior data The maximum and minimum of middle preset range；Feature normalization processing is done to the interbehavior data.

In a kind of exemplary embodiment of the disclosure, the interbehavior data, which are pre-processed, also to be included：According to Operation of the previous day user of the interbehavior data and current time to the content increases by a row characteristic value.

In a kind of exemplary embodiment of the disclosure, in addition to：Obtain the category preference data in kind of user；According to institute User described in category preference data amendment in kind is stated to the preference probability of the label.

In a kind of exemplary embodiment of the disclosure, in addition to：According to the preference probability selection content recommendation；Obtain User is to the click data of the content recommendation, according to preference probability described in the click data amendment.

According to the second aspect of the disclosure, there is provided a kind of user preference analytical equipment based on big data, including：Data obtain Modulus block, for obtaining the interbehavior data of user and content, the content has at least one label；Feature pre-processes mould Block, for being pre-processed to the interbehavior data and generating characteristic data set, using the characteristic data set be as The input feature vector value of gcForest models；Forest module is cascaded, for each level in the gcForest models to be joined into forest The class probability vector of output and input feature vector of the feature of the characteristic data set as next level connection forest；Preference calculates mould Block, the class probability vector for joining forest output according to described last level of gcForest models obtain user to the label Preference probability.

In a kind of exemplary embodiment of the disclosure, the feature pretreatment module includes：Missing values processing unit, use In judging to whether there is missing data in the interbehavior data, if there is then supplementing missing data；Outlier processing list Member, for deleting the maximum and minimum of preset range in the interbehavior data；Normalized unit, for institute State interbehavior data and do feature normalization processing.

In a kind of exemplary embodiment of the disclosure, the feature pretreatment module also includes：Feature adding unit, use According to operation increase by one row characteristic value of the previous day user of the interbehavior data and current time to the content.

In a kind of exemplary embodiment of the disclosure, in addition to：Preference correcting module in kind, for obtaining the reality of user Article class preference data, and preference probability of the user to the label according to the category preference data amendment in kind.

In a kind of exemplary embodiment of the disclosure, in addition to：Clicking rate correcting module, for general according to the preference Rate selects content recommendation, and obtains click data of the user to the content recommendation, according to the click data amendment partially Good probability.

According to the third aspect of the disclosure, there is provided a kind of computer-readable recording medium, computer program is stored thereon with, The program realizes the method and step described in above-mentioned any one when being executed by processor.

By the present invention in that big data sample is distributed with more granularities cascade forest algorithm gcForest after improving Formula processing, and user is analyzed to the preference of content tab according to output result, it can be obtained under conditions of using more rich data More accurate user preference analysis result is taken, lifts personalized recommendation efficiency, improves Consumer's Experience.

It should be appreciated that the general description and following detailed description of the above are only exemplary and explanatory, not The disclosure can be limited.

Brief description of the drawings

Accompanying drawing herein is merged in specification and forms the part of this specification, shows the implementation for meeting the disclosure Example, and be used to together with specification to explain the principle of the disclosure.It should be evident that drawings in the following description are only the disclosure Some embodiments, for those of ordinary skill in the art, on the premise of not paying creative work, can also basis These accompanying drawings obtain other accompanying drawings.

Fig. 1 schematically shows the flow of the user preference analysis method based on big data in disclosure exemplary embodiment Figure.

Fig. 2 is the schematic diagram of interbehavior data in disclosure exemplary embodiment.

Fig. 3 is the flow chart pre-processed in disclosure exemplary embodiment to interbehavior data.

Fig. 4 is the flow chart that missing values in interbehavior data are handled in disclosure exemplary embodiment.

Fig. 5 is to carry out pretreated tables of data schematic diagram to interbehavior data in disclosure exemplary embodiment.

Fig. 6 is the schematic diagram of more granularity cascade forest (gcForest) structures.

Fig. 7 is to cascade the class probability vector generation schematic diagram in forest.

Fig. 8 is the schematic diagram being improved in disclosure exemplary embodiment to gcForest algorithms.

Fig. 9 is the preference probability data table of the user that is exported in disclosure exemplary embodiment to label.

Figure 10 is a kind of user preference analysis method flow chart in disclosure exemplary embodiment.

Figure 11 schematically shows a kind of user preference analysis dress based on big data in one exemplary embodiment of the disclosure The block diagram put.

Figure 12 schematically shows user preference analysis of the another kind based on big data in one exemplary embodiment of the disclosure The block diagram of device.

Embodiment

Example embodiment is described more fully with referring now to accompanying drawing.However, example embodiment can be with a variety of shapes Formula is implemented, and is not understood as limited to example set forth herein；On the contrary, these embodiments are provided so that the disclosure will more Fully and completely, and by the design of example embodiment comprehensively it is communicated to those skilled in the art.Described feature, knot Structure or characteristic can be incorporated in one or more embodiments in any suitable manner.In the following description, there is provided permitted More details fully understand so as to provide to embodiment of the present disclosure.It will be appreciated, however, by one skilled in the art that can Omitted with putting into practice the technical scheme of the disclosure one or more in the specific detail, or others side can be used Method, constituent element, device, step etc..In other cases, be not shown in detail or describe known solution a presumptuous guest usurps the role of the host to avoid and So that each side of the disclosure thickens.

In addition, accompanying drawing is only the schematic illustrations of the disclosure, identical reference represents same or similar portion in figure Point, thus repetition thereof will be omitted.Some block diagrams shown in accompanying drawing are functional entitys, not necessarily necessary and thing The entity managed or be logically independent is corresponding.These functional entitys can be realized using software form, or in one or more These functional entitys are realized in hardware module or integrated circuit, or in heterogeneous networks and/or processor device and/or microcontroller These functional entitys are realized in device.

Disclosure example embodiment is described in detail below in conjunction with the accompanying drawings.

Fig. 1 schematically shows the flow of the user preference analysis method based on big data in disclosure exemplary embodiment Figure.With reference to figure 1, the user preference analysis method 100 based on big data includes：

Step S102, obtains the interbehavior data of user and content, and the content has at least one label.

Step S104, the interbehavior data are pre-processed and generate characteristic data set, by the characteristic Input feature vector value of the collection i.e. as gcForest models.

Step S106, by the class probability vector of each level connection forest output and the feature in the gcForest models Input feature vector of the feature of data set as next level connection forest.

Step S108, the class probability vector that forest output is joined according to described last level of gcForest models obtain user To the preference probability of the label.

Distributed treatment, and root are carried out to Massive Sample by using improved more granularity cascade forest algorithm gcForest User is analyzed to the preference of content tab according to output result, and more accurate use can be obtained under conditions of using more rich data Family preference analysis result, personalized recommendation efficiency is lifted, improve Consumer's Experience.

Below, each step of method 100 is described in detail.

In step S102, the interbehavior data of acquisition user and content, the content has at least one label.

The disclosure signified " content " include but is not limited to article, commodity, music, video, books or other can be pushed away Recommend the content to user.For convenience of description, only so that article is recommended as an example, relevant technical staff in the field can voluntarily be set the disclosure This method is applied to the personalized recommendation of other guide.

Fig. 2 is the schematic diagram of mutual behavioral data in disclosure exemplary embodiment.With reference to figure 2, interbehavior data can be with The data of operation including user in preset time period to the content, the data include browsing number, thumb up number, share Number, comment on number, check details number, lower odd number.

Specifically, article channel is found in Jingdone district, the user for having direct action to article content in 90 days is about 13000000, this quantity can meet analyze data demand, therefore above-mentioned preset time period can be configured to 90 days, interaction row Can be the user behavior in 90 days for data.But in certain embodiments, analyze user thumb up feature, user shares The behaviors such as feature can choose operation data of the user in 30 days.

Following six feature can be extracted to the user behavior recorded in database by HIVE：

Feature 1：User goes over to browse fraction feature to label in 90 days；

Feature 2：User goes over to thumb up fraction feature to label in 90 days；

Feature 3：User goes over to share fraction feature to label in 30 days；

Feature 4：User goes over 30 days comment fraction features to label；

Feature 5：User goes over 30 days fraction features that commodity details are clicked in Shipping Options Page；

Feature 6：User's order fraction feature as caused by label.

HIVE is a data base tool based on Hadoop, and the data file of structuring can be mapped as to a data Storehouse table, and simple SQL query function is provided, it is very suitable for the statistical analysis of database.Each fraction feature bag of said extracted Quantity of the user behavior in preset time period is included, can be according to the tool of each dimensional characteristics when handling each fraction feature Body business, it is that each data set characteristic value by data prediction.In some embodiments, it is also possible to according to each user behavior Specific weight set characteristic value, characteristic value for example can be with user behavior quantity or the weighted value of user behavior quantity.Set special The method of value indicative can be set by relevant technical staff in the field according to actual conditions, and the disclosure is not particularly limited to this.

The disclosure clicks on the correlation of conversion ratio by calculating each feature in big measure feature to target prediction variable user Property significance level, according to information gain choose above most worthy six features, user's row to more than 90% can be covered For the analysis of information.

As shown in Fig. 2 the characteristic form first extracted is classified as user name, second is classified as content tab, the 3rd row The characteristic value for six features for being classified as HIVE extractions to the 8th.

In step S104, the interbehavior data are pre-processed and generate characteristic data set, by the characteristic Input feature vector value according to collection i.e. as gcForest models.

Fig. 3 is the flow chart pre-processed in disclosure exemplary embodiment to interbehavior data.It is right with reference to figure 3 The flow that interbehavior data are pre-processed can include：

Step S302, judge to whether there is missing data in the interbehavior data, if there is then supplementing missing number According to.

Step S304, delete the maximum and minimum of preset range in the interbehavior data.

Step S306, feature normalization processing is done to the interbehavior data.

With reference to figure 4, in step S302, after obtaining data, initially enter step S3020 and judge whether missing values, If there is no missing values, then into step S304；If there is missing values, then into step S3021, whether missing values are judged It is significant.Meaning refers to that characteristic has predicted whether key effect for target.For example, six characteristic values of a user Only missing one, then may determine that this user is any active ues, and missing feature does not influence to predict the behavior of the user, now may be used To judge that the missing values are meaningless.In certain embodiments, judge that missing values can be by calculating the ratios of missing values and existing value Example, when the ratio is less than threshold value, it is possible to determine that missing values are meaningless, when the ratio is more than or equal to threshold value, then it is assumed that lack Mistake value is important, significant.

It is that missing values create one into step S3022 when it is classification type that missing values are significant and are missing values Classification；When missing values are significant and missing values are numeric type, missing values are arranged to rational into step S3023, such as Missing values are arranged to the average or median of all characteristic values under this feature；When missing values are nonsensical, into step S3024 judges specific shortage of data situation.If overall data missing data is few (being less than a threshold value), into step S3025 deletes the example of missing data；If overall data is temporally orderly, into step S3026 then with than overall data when Between a value in data earlier replace the missing values (such as selection time closest to the data of overall data earliest time)； If other situations, then judge whether overall data obeys simple distribution into step S3027, if overall data is disobeyed Simple distribution, then produce the substitution value of missing values using simple machine learning model into step S3028 and enter step S304； If overall data obeys simple distribution and is no different constant value, being substituted into step S3029 using the average value of the column data should Missing values simultaneously enter step S304；If overall data obeys simple distribution and has exceptional value, being used into step S3030 should The median of column data substitutes the missing values and enters step S304.Exceptional value is in one group of numerical value and the deviation of average value exceedes The numerical value of twice of standard deviation.

In step S304, by analyzing the data distribution of each feature, the preceding x% and rear y% of data distribution data are deleted, The exceptional value in data can be deleted.Wherein, x and y is the natural number between 1~99, be can be the same or different, in this public affairs In the exemplary embodiment opened, x and y can be for example 5, you can with by deleting 5% data maximum in data and most Small 5% data delete the exceptional value in data.

In step S306, normalized can be done to the data of each feature.The formula of data normalization for example can be with For：

Wherein, y_inewBe normalization after characteristic value, y_iIt is former data, y_minBe this feature all existing datas in most Small value, y_maxBe this feature all existing datas in maximum.Characteristic after normalization is distributed between 0 to 1.

In a kind of exemplary embodiment of the disclosure, carrying out pretreatment to interbehavior data can also include：Step S308, a row feature is increased according to operation of the previous day user of the interbehavior data and current time to the content Value.

On the basis of with reference to behavior of the user in preset time period, it can observe before whether user clicked yesterday There is the content of the label of operation.If it is determined that user clicked the content for the label for once having operation in yesterday, then to the user With the interbehavior data addition characteristic value " 1 " of the label；If it is determined that user once had the label of operation without clicking in yesterday Content, then characteristic value " 0 " is added to the interbehavior data of the user and the label.Characteristic value after addition constitutes a row New feature.

After being pre-processed to data, the partial data of training dataset can be constructed by big data platform Spark. Fig. 5 is to carry out pretreated tables of data schematic diagram to interbehavior data in disclosure exemplary embodiment.With reference to figure 5, number The characteristic value of addition according to table first row, behind each leu time be feature number：Characteristic value.

By being pre-processed to the interbehavior got, can be provided for ensuing analysis process more effectively, more Accurate data source.

Next, using pretreated data as the input data of machine learning model.

What deserves to be explained is, it is necessary to be trained using data set to machine mould before test data.After training Model can be used for the data set tested including training dataset, in some embodiments of the present disclosure, for test Data set can include the active user behavioral data by pretreatment obtained from line in data flow.

In an exemplary embodiment of the disclosure, the machine learning for choosing gcForest algorithms as analysis user preference is calculated Method.GcForest (more granularities cascade forest) algorithm is a kind of more granularity cascade decision tree integrated approaches, compared to depth nerve Feature learning in network, which depends on, successively to be handled primitive character, gcForest algorithms using cascade structure allow by Multiple forests of decision tree composition are cooked feature learning.More granularities scanning input in gcForest algorithms can strengthen cascade forest Feature learning ability, compared to traditional logistic regression algorithm, more effective feature extraction can be carried out, be more suitable for big data essence The personalized recommendation of standardization, it is more suitable for disposing parallel, and there is the advantages that theory analysis is simple, and tuning parameter is less.

Fig. 6 is the schematic diagram of more granularity cascade forest (gcForest) structures.With reference to figure 6, every one-level in forest is cascaded The characteristic information handled by previous stage is received, and the result of this grade is exported to next stage.Each cascading layers include two Individual random forest and two completely random forests, each completely random forest include 1000 completely random trees, each random gloomy Woods includes 1000 random trees.Model training is divided into feature generation phase using cascade structure for gcForest algorithms and result is defeated Go out two stages of stage.Completely random tree in feature generation phase, completely random forest randomly chooses a feature and set Each node classified, tree is grown always, until each leaf node is only comprising mutually similar example or no more than 10 Individual example；Relatively, the random tree selection feature in random forest opens the feature of radical sign number as candidate feature, and selects to have There is the feature of optimal gini values as characteristic of division.Assuming that there is n class to predict, each forest will export n dimension class probability vectors, Then input data of the combination of interactions feature as next stage forest is connected it as.

Fig. 7 is to cascade the class probability vector generation schematic diagram in forest.With reference to figure 7, the not isolabeling in leaf node represents Different class.When a new customer instance enters gcForest models, each forest can be calculated in related example Inhomogeneous sample percentage at the leaf node fallen into, average value is calculated to all trees in forest, class divided with generation The estimation of cloth, i.e., each forest can export a class probability vector.In order to reduce over-fitting risk, class caused by each forest is general Rate vector is by K folding cross validation generations.

In step S106, each level in the gcForest models is joined into the class probability vector of forest output and the spy Levy input feature vector of the feature of data set as next level connection forest.

Fig. 8 is the schematic diagram being improved in disclosure exemplary embodiment to gcForest algorithms.With reference to figure 8, slave phase The example extracted with the sliding characteristics window of size will be used to train completely random forest and random forest, trained forest Class probability vector is generated, and class probability vector is connected as the feature after conversion.Pass through compared to existing gcForest algorithms The new input interaction feature of upper level cascade forest output primitive character combination of interactions construction next stage cascade forest, the present invention Improved when constructing new input feature vector, in addition to the input feature vector of first layer cascade forest is pretreated 7 features, Other cascade forests input feature vectors be original 7 features, upper level cascade forest output combination of interactions feature with it is upper The preference probability characteristics of the leaf node of one-level cascade forest output, i.e. using the preference probability of upper level forest prediction as under The new feature of one-level forest input.Specifically, the preference probability that upper level forest is predicted is made as new feature and original 7 spies Sign and the combination of interactions feature of last layer forest output are captured by sliding window together, and using preference probability as characteristic of division Add in the characteristic of division of next stage forest.Increase feature by joining forest to each level, gcForest algorithms can be improved Accuracy.

In this reference chart 7, the form of the class probability vector of gcForest models last level connection forest output for a, b, c,d,……}.Wherein, vectorial number of elements is identical with the number of labels being related to, and each element sum is equal to 1, each element Preference probability of the implication for user to a label., can be with by the class probability vector for the example for obtaining several users Obtain preference fraction of the several users to its preference label.The calculating of preference fraction can be by relevant technical staff in the field voluntarily Set according to actual conditions, as long as installation warrants are preference probability of the user to label.

Fig. 9 is the preference probability data table of the user that is exported in disclosure exemplary embodiment to label.With reference to figure 9, partially Good probability data table first row is user name, and secondary series and row afterwards are labels：Preference fraction # labels：Preference point Number ....

Figure 10 is a kind of user preference analysis method flow chart in disclosure exemplary embodiment.Figure 10 of reference, user Preference analysis method 1000 can also include in addition to the Overall Steps including user preference analysis method 100：

Step S1002, obtain the category preference data in kind of user.

Step S1004, according to user described in the category preference data amendment in kind to the preference probability of the label.

Step S1006, according to the preference probability selection content recommendation.

Step S1008, click data of the user to the content recommendation is obtained, according to the click data amendment partially Good probability.

, can be with when user preference analysis method 1000 is used to analyze user to the preference of electric business website story label It is general with category preference in kind being associated according to the user preference that gcForest Algorithm Analysis goes out with user to the preference of category in kind Rate corrects user preference probability to expand.

It is possible, firstly, to find the corresponding relation of commodity three-level category and label, and weight is done to the preference fraction of label and returned One change is handled：

(1) using user as association major key, simultaneously association user-commodity three-level category-preference data table and user-mark are obtained Label-preference data table, association results are denoted as TableA；

(2) in TableA, it is to associate major key to be numbered with tag number with commodity three-level category, is calculated under each label The fraction of each commodity three-level category, is designated as score；

(3) in TableA, using tag number as association major key, the preference fraction total score under each label is calculated, is remembered For sumScore；

(4) the preference fraction total score of all labels is calculated, is designated as allScore；

(5) using commodity three-level category as association major key, the gross score under each commodity three-level category is calculated, is designated as sum；

(6) to one filtering threshold of each tag computation：Calculating the score under this label, to account for overall label score total The ratio of sum, the ratio are exactly the filtering threshold of each label, are designated as tagRatio；

(7) multiple tag numbers can be corresponded under each commodity three-level category, tag number, which is left the foundation come, is：The business Label score divided by this commodity three-level category gross score under product three-level category are greater than filtering threshold tagRatio；

(8) ranking score is normalized：Label weight fraction is calculated to the tag number that each commodity three-level category is retained, it is public Formula is as follows：

Normalized can make user's corresponding label variation under commodity three-level category.

The user of electric business website may be to the preference of story label, but has to the inclined of the commodity three-level category of commodity It is good, it can now give label corresponding to user's Recommendations three-level category：Using commodity three-level category as association major key, with normalization Ranking score association user commodity three-level category preference table；The fraction of user's commodity three-level category preference is multiplied by label weight point Number is used as expansion amount label fraction；Using user and label as Macintosh, fraction of the user to expansion amount label is calculated.

The preference amendment user of commodity three-level category can more accurately be obtained to the preference of label by using user The preference of user.

In a kind of exemplary embodiment of the disclosure, in addition to：

Step S114, according to the preference probability selection content recommendation；

Step S116, click data of the user to the content recommendation is obtained, according to the click data amendment partially Good probability.

, can be by under several labels of every user preference maximum probability after user is obtained to the preference of label Appearance recommends user, wherein, threshold value can be less than or equal to for quantity by selecting the standard of label, or preference probability is more than one Threshold value, or it is more than a threshold value for preference fraction.The disclosure is not particularly limited to this.

Can be by marking click of the record user to content recommendation.In certain embodiments, user can be clicked on and pushed away It is ' 1 ' to recommend content-label, user is not clicked on to content recommendation labeled as ' 0 ', in some embodiments, it is also possible to multiple in user Number of clicks is recorded when clicking on content recommendation.

By obtaining click of the user to content recommendation, above-mentioned gcForest models can be trained real towards user is more met The direction study of border preference, the model for learning are used for predicting new data, that is, realized general by clicking rate amendment user preference The purpose of rate.

Whether cause user really to click on behavior to content recommendation by business on joint line and make statistics, and instructed with this Practice model, PV (PageView, page browsing amount), UV (UniqueVisitor, independent visit can be brought to operation model on line Visitor) lifting.

The business experience that Data Analyst is relied on compared to traditional logistic regression algorithm weighs characteristic coefficient, in sample Sampled portions data do the limitation of statistical analysis in sheet, and gcForest algorithms can handle mass data, and can be to complexity Feature is re-worked, and finds the mutual effect between feature, and model is easily trained, and interpretation is stronger than deep neural network, energy It is enough to export more accurately judged result, it is more suitable for the business scenario of complexity.The disclosure passes through with gcForest points after improving User preference probability is analysed, and combines specific business amendment user preference, user preference is more accurately analyzed, optimizes user's body Test, more incomes are brought for business on line.

Corresponding to above method embodiment, the disclosure also provides a kind of user preference analytical equipment based on big data, can For performing above method embodiment.

With reference to figure 11, the user preference analytical equipment 1100 based on big data includes：

Data acquisition module 1102, for obtaining the interbehavior data of user and content, the content has at least one Individual label.

Feature pretreatment module 1104, for being pre-processed to the interbehavior data and generating characteristic data set, It is the input feature vector value as gcForest models using the characteristic data set.

Cascade forest module 1106, for by the class probability of each level connection forest output in the gcForest models to The input feature vector of amount and the feature of the characteristic data set as next level connection forest.

Preference computing module 1108, for joining the class probability of forest output according to described last level of gcForest models Vector obtains preference probability of the user to the label.

In a kind of exemplary embodiment of the disclosure, the feature pretreatment module includes：

Missing values processing unit 11042, for judging to whether there is missing data in the interbehavior data, if deposited Then supplementing missing data.

Outlier processing unit 11044, for delete in the interbehavior data maximum of preset range with it is minimum Value.

Normalized unit 11046, for doing feature normalization processing to the interbehavior data.

In a kind of exemplary embodiment of the disclosure, the feature pretreatment module also includes：

Feature adding unit 11048, for the previous day user couple according to the interbehavior data and current time The operation of the content increases by a row characteristic value.

In a kind of exemplary embodiment of the disclosure, in addition to：

Preference correcting module 1110 in kind, for obtaining the category preference data in kind of user, and according to the product in kind Preference probability of the user to the label described in class preference data amendment.

In a kind of exemplary embodiment of the disclosure, in addition to：

Clicking rate correcting module 1112, for according to the preference probability selection content recommendation, and user is obtained to described The click data of content recommendation, according to preference probability described in the click data amendment.

Because each function of device 1100 has been described in detail in its corresponding embodiment of the method, the disclosure in this not Repeat again.

According to an aspect of this disclosure, there is provided a kind of user preference analytical equipment based on big data, including：

Memory；And

The processor of memory, the processor are configured as based on the finger being stored in the memory belonging to being coupled to Order, perform the method as described in above-mentioned any one.

The concrete mode of the computing device operation of device in the embodiment is about being somebody's turn to do the use based on big data Detailed description is performed in the embodiment of family preference analysis method, explanation will be not set forth in detail herein.

Figure 12 is a kind of block diagram of device 1300 according to an exemplary embodiment.Device 1300 can be intelligent hand The mobile terminals such as machine, tablet personal computer.

Reference picture 12, device 1200 can include following one or more assemblies：Processing component 1202, memory 1204, Power supply module 1206, multimedia groupware 1208, audio-frequency assembly 1210, sensor cluster 1214 and communication component 1216.

The integrated operation of the usual control device 1200 of processing component 1202, such as communicated with display, call, data, Operation that camera operation and record operation are associated etc..Processing component 1202 can include one or more processors 1218 Execute instruction, to complete all or part of step of above-mentioned method.In addition, processing component 1202 can include one or more Module, the interaction being easy between processing component 1202 and other assemblies.For example, processing component 1202 can include multimedia mould Block, to facilitate the interaction between multimedia groupware 1208 and processing component 1202.

Memory 1204 is configured as storing various types of data to support the operation in device 1200.These data Example includes the instruction of any application program or method for being operated on device 1200.Memory 1204 can be by any class The volatibility or non-volatile memory device or combinations thereof of type are realized, such as static RAM (SRAM), electricity Erasable Programmable Read Only Memory EPROM (EEPROM), Erasable Programmable Read Only Memory EPROM (EPROM), programmable read only memory (PROM), read-only storage (ROM), magnetic memory, flash memory, disk or CD.One is also stored with memory 1204 Individual or multiple modules, one or more modules are configured to be performed by the one or more processors 1218, above-mentioned to complete All or part of step in method shown in any.

Power supply module 1206 provides electric power for the various assemblies of device 1200.Power supply module 1206 can include power management System, one or more power supplys, and other components associated with generating, managing and distributing electric power for device 1200.

Multimedia groupware 1208 is included in the screen of one output interface of offer between described device 1200 and user. In some embodiments, screen can include liquid crystal display (LCD) and touch panel (TP).If screen includes touch panel, Screen may be implemented as touch-screen, to receive the input signal from user.Touch panel includes one or more touch and passed Sensor is with the gesture on sensing touch, slip and touch panel.The touch sensor can not only sensing touch or slip be dynamic The border of work, but also detect the duration and pressure related to the touch or slide.

Audio-frequency assembly 1210 is configured as output and/or input audio signal.For example, audio-frequency assembly 1210 includes a wheat Gram wind (MIC), when device 1200 is in operator scheme, during such as call model, logging mode and speech recognition mode, microphone quilt It is configured to receive external audio signal.The audio signal received can be further stored in memory 1204 or via communication Component 1216 is sent.In certain embodiments, audio-frequency assembly 1210 also includes a loudspeaker, for exports audio signal.

Sensor cluster 1214 includes one or more sensors, and the state for providing various aspects for device 1200 is commented Estimate.For example, sensor cluster 1214 can detect opening/closed mode of device 1200, the relative positioning of component, sensor Component 1214 can be changed with the position of 1,200 1 components of detection means 1200 or device and the temperature change of device 1200. In certain embodiments, the sensor cluster 1214 can also include Magnetic Sensor, pressure sensor or temperature sensor.

Communication component 1216 is configured to facilitate the communication of wired or wireless way between device 1200 and other equipment.Dress The wireless network based on communication standard, such as WiFi, 2G or 3G, or combinations thereof can be accessed by putting 1200.It is exemplary at one In embodiment, communication component 1216 receives broadcast singal or broadcast correlation from external broadcasting management system via broadcast channel Information.In one exemplary embodiment, the communication component 1216 also includes near-field communication (NFC) module, to promote short distance Communication.For example, radio frequency identification (RFID) technology, Infrared Data Association (IrDA) technology, ultra wide band can be based in NFC module (UWB) technology, bluetooth (BT) technology and other technologies are realized.

In the exemplary embodiment, device 1200 can be by one or more application specific integrated circuits (ASIC), numeral Signal processor (DSP), digital signal processing appts (DSPD), PLD (PLD), field programmable gate array (FPGA), controller, microcontroller, microprocessor or other electronic components are realized, for performing the above method.

In a kind of exemplary embodiment of the disclosure, a kind of computer-readable recording medium is additionally provided, is stored thereon There is program, the user preference analysis side based on big data as described in above-mentioned any one is realized when the program is executed by processor Method.The computer-readable recording medium for example can be the provisional and non-transitory computer-readable storage medium for including instruction Matter.

Those skilled in the art will readily occur to the disclosure its after considering specification and putting into practice invention disclosed herein Its embodiment.The application is intended to any modification, purposes or the adaptations of the disclosure, these modifications, purposes or Person's adaptations follow the general principle of the disclosure and including the undocumented common knowledges in the art of the disclosure Or conventional techniques.Description and embodiments are considered only as exemplary, and the true scope of the disclosure and design will by right Ask and point out.

Claims

A kind of 1. user preference analysis method based on big data, it is characterised in that including：

The interbehavior data of user and content are obtained, the content has at least one label；

The interbehavior data are pre-processed and generate characteristic data set, using the characteristic data set be as The input feature vector value of gcForest models；

By the class probability vector and the feature of the characteristic data set of each level connection forest output in the gcForest models Input feature vector as next level connection forest；

The class probability vector for joining forest output according to described last level of gcForest models obtains user to the label Preference probability.
2. user preference analysis method according to claim 1, it is characterised in that the interbehavior data include user The data of operation in preset time period to the content, the data include browse number, thumb up number, share number, comment number, Check details number, lower odd number.
3. user preference analysis method according to claim 1, it is characterised in that the interbehavior data are carried out pre- Processing includes：

Judge to whether there is missing data in the interbehavior data, if there is then supplementing missing data；

Delete the maximum and minimum of preset range in the interbehavior data；

Feature normalization processing is done to the interbehavior data.
4. user preference analysis method according to claim 1, it is characterised in that the interbehavior data are carried out pre- Processing also includes：

One row feature is increased according to operation of the previous day user of the interbehavior data and current time to the content Value.
5. user preference analysis method according to claim 1, it is characterised in that also include：

Obtain the category preference data in kind of user；

According to user described in the category preference data amendment in kind to the preference probability of the label.
6. user preference analysis method according to claim 1, it is characterised in that also include：

According to the preference probability selection content recommendation；

Click data of the user to the content recommendation is obtained, according to preference probability described in the click data amendment.
A kind of 7. user preference analytical equipment based on big data, it is characterised in that including：

Data acquisition module, for obtaining the interbehavior data of user and content, the content has at least one label；

Feature pretreatment module, for being pre-processed to the interbehavior data and generating characteristic data set, by the spy It is the input feature vector value as gcForest models to levy data set；

Cascade forest module, for by the class probability vector of each level connection forest output in the gcForest models with it is described Input feature vector of the feature of characteristic data set as next level connection forest；

Preference computing module, the class probability vector for joining forest output according to described last level of gcForest models obtain Preference probability of the user to the label.
8. user preference analytical equipment according to claim 7, it is characterised in that the interbehavior data include user The data of operation in preset time period to the content, the data include browse number, thumb up number, share number, comment number, Check details number, lower odd number.
9. user preference analytical equipment according to claim 7, it is characterised in that the feature pretreatment module includes：

Missing values processing unit, for judging to whether there is missing data in the interbehavior data, if there is then supplementing Missing data；

Outlier processing unit, for deleting the maximum and minimum of preset range in the interbehavior data；

Normalized unit, for doing feature normalization processing to the interbehavior data.
10. user preference analytical equipment according to claim 7, it is characterised in that the feature pretreatment module is also wrapped Include：

Feature adding unit, for according to the previous day user of the interbehavior data and current time to the content Operation one row characteristic value of increase.
11. user preference analytical equipment according to claim 7, it is characterised in that also include：

Preference correcting module in kind, for obtaining the category preference data in kind of user, and according to the category preference number in kind Preference probability according to the amendment user to the label.
12. user preference analytical equipment according to claim 7, it is characterised in that also include：

Clicking rate correcting module, for according to the preference probability selection content recommendation, and user is obtained to the content recommendation Click data, according to preference probability described in the click data amendment.
13. a kind of computer-readable recording medium, is stored thereon with computer program, it is characterised in that the program is by processor The method and step described in claim any one of 1-6 is realized during execution.