CN107590224B

CN107590224B - Big data based user preference analysis method and device

Info

Publication number: CN107590224B
Application number: CN201710786530.7A
Authority: CN
Inventors: 王颖帅; 李晓霞; 苗诗雨
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2017-09-04
Filing date: 2017-09-04
Publication date: 2021-11-30
Anticipated expiration: 2037-09-04
Also published as: CN107590224A

Abstract

The disclosure provides a big data-based user preference analysis method and device. The method comprises the following steps: acquiring interactive behavior data of a user and content, wherein the content is provided with at least one label; preprocessing the interaction behavior data to generate a characteristic data set, and taking the characteristic data set as an input characteristic value of a gcForest model; taking the class probability vector output by each level of cascade forest in the gcForest model and the characteristics of the characteristic data set as the input characteristics of the next level of cascade forest; and acquiring the preference probability of the user to the label according to the class probability vector output by the last layer of cascade forest of the gcForest model. The user preference analysis method provided by the disclosure can provide more accurate user preference analysis results based on big data samples.

Description

Big data based user preference analysis method and device

Technical Field

The disclosure relates to the technical field of machine learning, in particular to a big data-based user preference analysis method and device.

Background

With the development of internet technology, content personalized recommendation for users is more and more popular. By taking article recommendation as an example, one or more labels are set for each article according to article contents, and the operation of the user on the article is obtained, which labels the user prefers can be analyzed, so that other articles under the labels can be recommended for the user, and the user experience is improved.

In the existing personalized recommendation technology, the method for analyzing the user preference mainly comprises an LR (low-rate) logistic regression algorithm-based analysis method and a statistical formula scoring method which is obtained by analyzing each feature according to time weight based on an analyst strategy. In the LR-based logistic regression algorithm analysis, a data analyst needs to analyze which features need to be extracted and in what manner the content is tagged based on business experience. After the characteristics and the label data are obtained, different labels are subjected to hierarchical sampling, and the coefficient of each characteristic is obtained by utilizing a logistic regression model of statistical analysis software, so that a user label preference score formula is determined. The statistical scoring method based on time weight is to assume that the user prefers the content selected at the latest time to the content selected at the later time, so as to maintain a piece of data according to the time weight, namely finding a proper function to determine the time weight of 365 days in the year and each day, and finally combining each characteristic to obtain a statistical formula with time dimension.

In the above technology, the LR logistic regression algorithm analysis method requires an analyst to determine the coefficient of each feature according to the experience of the analyst, and is strongly dependent on the experience of the analyst, and each service requires manual analysis, which is inefficient and has a small number of samples. Since the preference degrees of the users to the content in different time periods are different, it is difficult to find the most suitable time weight function, and therefore it is difficult to accurately mine the user preference based on the time weight statistical scoring method.

Therefore, the user preference analysis method algorithm capable of processing a large number of samples and providing more accurate analysis results is of great significance for improving personalized recommendation capability and improving content click rate.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

An object of the present disclosure is to provide a big data based user preference analysis method and apparatus for overcoming, at least to some extent, one or more of the problems due to the limitations and disadvantages of the related art.

According to a first aspect of the embodiments of the present disclosure, there is provided a big data-based user preference analysis method, including: acquiring interactive behavior data of a user and content, wherein the content is provided with at least one label; preprocessing the interaction behavior data to generate a characteristic data set, and taking the characteristic data set as an input characteristic value of a gcForest model; taking the class probability vector output by each level of cascade forest in the gcForest model and the characteristics of the characteristic data set as the input characteristics of the next level of cascade forest; and acquiring the preference probability of the user to the label according to the class probability vector output by the last layer of cascade forest of the gcForest model.

In an exemplary embodiment of the disclosure, the interactive behavior data includes data of an operation of the user on the content within a preset time period, and the data includes browsing number, praise number, share number, comment number, view details number, and next number.

In an exemplary embodiment of the present disclosure, preprocessing the interaction behavior data includes: judging whether the interactive behavior data has missing data or not, and if so, supplementing the missing data; deleting a maximum value and a minimum value of a preset range in the interactive behavior data; and performing characteristic normalization processing on the interactive behavior data.

In an exemplary embodiment of the present disclosure, preprocessing the interaction behavior data further includes: and adding a list of characteristic values according to the interactive behavior data and the operation of the user on the content on the day before the current time.

In an exemplary embodiment of the present disclosure, further comprising: acquiring object class preference data of a user; and correcting the preference probability of the user to the label according to the object class preference data.

In an exemplary embodiment of the present disclosure, further comprising: selecting recommended content according to the preference probability; and acquiring click data of the user on the recommended content, and correcting the preference probability according to the click data.

According to a second aspect of the present disclosure, there is provided a big data-based user preference analysis apparatus, including: the data acquisition module is used for acquiring interactive behavior data of a user and content, and the content is provided with at least one label; the characteristic preprocessing module is used for preprocessing the interactive behavior data and generating a characteristic data set, and the characteristic data set is used as an input characteristic value of the gcForest model; the cascade forest module is used for taking the class probability vector output by each level of cascade forest in the gcForest model and the characteristics of the characteristic data set as the input characteristics of the next level of cascade forest; and the preference calculation module is used for acquiring the preference probability of the user on the label according to the class probability vector output by the last layer of cascade forest of the gcForest model.

In an exemplary embodiment of the present disclosure, the feature preprocessing module includes: the missing value processing unit is used for judging whether the interactive behavior data has missing data or not, and if so, supplementing the missing data; the abnormal value processing unit is used for deleting the maximum value and the minimum value of a preset range in the interactive behavior data; and the normalization processing unit is used for carrying out characteristic normalization processing on the interactive behavior data.

In an exemplary embodiment of the present disclosure, the feature preprocessing module further includes: and the characteristic increasing unit is used for increasing a list of characteristic values according to the interactive behavior data and the operation of the user on the content in the day before the current time.

In an exemplary embodiment of the present disclosure, further comprising: and the real object preference correction module is used for acquiring real object type preference data of a user and correcting the preference probability of the user to the label according to the real object type preference data.

In an exemplary embodiment of the present disclosure, further comprising: and the click rate correction module is used for selecting recommended contents according to the preference probability, acquiring click data of the recommended contents from the user and correcting the preference probability according to the click data.

According to a third aspect of the present disclosure, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the method steps of any one of the above.

According to the method, the improved multi-granularity cascade forest algorithm gcForest is used for carrying out distributed processing on the big data sample, the preference of the user to the content label is analyzed according to the output result, the more accurate user preference analysis result can be obtained under the condition of using richer data, the personalized recommendation efficiency is improved, and the user experience is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty.

Fig. 1 schematically illustrates a flow chart of a big data based user preference analysis method in an exemplary embodiment of the present disclosure.

FIG. 2 is a schematic illustration of interaction behavior data in an exemplary embodiment of the disclosure.

FIG. 3 is a flow chart of pre-processing interaction behavior data in an exemplary embodiment of the disclosure.

FIG. 4 is a flow chart for handling missing values in interaction behavior data in an exemplary embodiment of the disclosure.

FIG. 5 is a data representation intent after preprocessing interaction behavior data in an exemplary embodiment of the disclosure.

FIG. 6 is a schematic diagram of a multiple granularity cascading forest (gcForest) structure.

FIG. 7 is a schematic diagram of class probability vector generation in cascading forests.

Fig. 8 is a schematic diagram of an improvement to the gcForest algorithm in an exemplary embodiment of the present disclosure.

Fig. 9 is a table of user preference probability data for tags output in an exemplary embodiment of the present disclosure.

Fig. 10 is a flowchart of a user preference analysis method in an exemplary embodiment of the present disclosure.

Fig. 11 schematically illustrates a block diagram of a big data based user preference analysis apparatus in an exemplary embodiment of the present disclosure.

Fig. 12 schematically illustrates a block diagram of another big data based user preference analysis apparatus according to an exemplary embodiment of the present disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and the like. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.

Further, the drawings are merely schematic illustrations of the present disclosure, in which the same reference numerals denote the same or similar parts, and thus, a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

The following detailed description of exemplary embodiments of the disclosure refers to the accompanying drawings.

Fig. 1 schematically illustrates a flow chart of a big data based user preference analysis method in an exemplary embodiment of the present disclosure. Referring to fig. 1, a big data based user preference analysis method 100 includes:

step S102, acquiring interactive behavior data of a user and content, wherein the content has at least one label.

And S104, preprocessing the interactive behavior data to generate a characteristic data set, and taking the characteristic data set as an input characteristic value of the gcForest model.

And S106, taking the class probability vector output by each level of cascade forest in the gcForest model and the characteristics of the characteristic data set as the input characteristics of the next level of cascade forest.

And S108, acquiring the preference probability of the user to the label according to the class probability vector output by the last layer of cascade forest of the gcForest model.

By using the improved multi-granularity cascade forest algorithm gcForest to perform distributed processing on the massive samples and analyzing the preference of the user on the content label according to the output result, a more accurate user preference analysis result can be obtained under the condition of using richer data, the personalized recommendation efficiency is improved, and the user experience is improved.

The steps of the method 100 are described in detail below.

In step S102, data of interaction behavior between a user and content is obtained, wherein the content has at least one tag.

"content" as referred to in this disclosure includes, but is not limited to, articles, merchandise, music, videos, books, or other content that may be recommended to a user. For convenience of description, the disclosure only takes article recommendation as an example, and those skilled in the art can set the method to be applied to personalized recommendation of other content by themselves.

FIG. 2 is a schematic illustration of interaction data in an exemplary embodiment of the disclosure. Referring to fig. 2, the interactive behavior data may include data of a user's operation on the content within a preset time period, where the data includes browsing number, praise number, share number, comment number, number of times of viewing details, and next number.

Specifically, in the channel of article discovery in the kyoto, about 1300 thousands of users who have direct behaviors on article contents within 90 days are found, and the number of users can meet the requirement of analyzing data, so the preset time period can be set to 90 days, and the interactive behavior data can be the user behaviors within 90 days. However, in some embodiments, analyzing the user's like characteristics, sharing characteristics of the user, and the like may select the operation data of the user within 30 days.

The following six features can be extracted by HIVE for the user behavior recorded in the database:

the method is characterized in that: browsing score characteristics of the tags by the user over 90 days;

and (2) feature: the user's likes-scores feature for the tags over the past 90 days;

and (3) feature: sharing score characteristics of the user to the tag over the past 30 days;

and (4) feature: a user's comment score feature for tags over the past 30 days;

and (5) feature: the user clicks the score feature of the commodity details on the label page in the past 30 days;

and (6) feature: user order score features caused by the label.

HIVE is a database tool based on Hadoop, can map structured data files into a database table, provides a simple SQL query function, and is very suitable for statistical analysis of databases. The extracted score features comprise the number of the user behaviors in a preset time period, and when the score features are processed, feature values can be set for data through data preprocessing according to specific services of each dimension feature. In some embodiments, the feature value may also be set according to a specific weight of each user behavior, and the feature value may be, for example, a number of user behaviors or a weighted value of the number of user behaviors. The method of setting the characteristic value may be set by those skilled in the art according to practical circumstances, and the present disclosure is not particularly limited thereto.

According to the method, the relevance importance degree of each feature on the click conversion rate of the target predictive variable user is calculated in a large number of features, and the six most valuable features are selected according to the information gain, so that the analysis on more than 90% of user behavior information can be covered.

As shown in fig. 2, the extracted feature data format has a first column of a user name, a second column of a content tag, and third to eighth columns of feature values of six features extracted by high.

In step S104, the interaction behavior data is preprocessed to generate a feature data set, and the feature data set is used as an input feature value of the gcForest model.

FIG. 3 is a flow chart of pre-processing interaction behavior data in an exemplary embodiment of the disclosure. Referring to fig. 3, the process of preprocessing the interactive behavior data may include:

step S302, judging whether the interactive behavior data has missing data, and if so, supplementing the missing data.

And step S304, deleting the maximum value and the minimum value of the preset range in the interactive behavior data.

And step S306, performing characteristic normalization processing on the interactive behavior data.

Referring to fig. 4, after acquiring data in step S302, step S3020 is first performed to determine whether a missing value exists, and if not, step S304 is performed; if there is a missing value, the flow proceeds to step S3021 to determine whether the missing value is significant. Meaning whether the prediction of the feature data for the object is relevant to the key action. For example, if only one of the six feature values of a user is missing, the user can be determined to be an active user, the missing feature does not affect the prediction of the user behavior, and at this time, the missing value is determined to be meaningless. In some embodiments, the missing value may be determined by calculating a ratio of the missing value to the existing value, and when the ratio is less than a threshold, the missing value may be determined to be meaningless, and when the ratio is greater than or equal to the threshold, the missing value may be considered to be significant.

When the missing value is meaningful and the missing value is a classification type, step S3022 is entered to create a classification for the missing value; when the missing value is meaningful and the missing value is numerical, the flow proceeds to step S3023 to set the missing value as a rational number, for example, as the average or median of all feature values under the feature; when the missing value is meaningless, the flow proceeds to step S3024 to determine a specific data missing situation. If the whole data is missing (less than a threshold), the step S3025 is entered to delete the instance of missing data; if the whole data is time-ordered, proceed to step S3026 to replace the missing value with one of the data earlier than the whole data time (e.g., select the data having the time closest to the earliest time of the whole data); if so, the process advances to step S3027 to determine whether the entire data complies with the simple distribution, and if not, the process advances to step S3028 to generate a substitute value for the missing value using the simple machine learning model and advances to step S304; if the whole data obeys simple distribution and has no abnormal value, the step S3029 is proceeded to replace the missing value with the average value of the column of data and the step S304 is proceeded to; if the whole data obeys the simple distribution and has an abnormal value, the process proceeds to step 3030 to replace the missing value with the median of the column of data and proceeds to step 304. Outliers are values in a set that deviate more than twice the standard deviation from the mean.

In step S304, by analyzing the data distribution of each feature, the data of the front x% and the rear y% of the data distribution is deleted, and the abnormal value in the data can be deleted. In the exemplary embodiment of the disclosure, x and y may both be 5, that is, abnormal values in data may be deleted by deleting 5% of the data at the maximum and 5% of the data at the minimum.

In step S306, normalization processing may be performed on the data of each feature. The formula for data normalization may be, for example:

wherein, y_inewIs the normalized eigenvalue, y_iIs the original data, y_minIs the minimum of all existing data for that feature, y_maxIs the maximum of all existing data for that feature. The feature data after normalization are distributed between 0 and 1.

In an exemplary embodiment of the present disclosure, preprocessing the interaction behavior data may further include: and step S308, adding a list of characteristic values according to the interactive behavior data and the operation of the user on the content in the day before the current time.

On the basis of referring to the behavior of the user within the preset time period, it can be observed whether the user clicked the content of the tag which has been operated before yesterday. If the user clicks the content of the label which has been operated yesterday, adding a characteristic value of '1' to the interactive behavior data of the user and the label; if it is determined that the user did not click on the contents of the tag that had been operated at yesterday, a characteristic value of "0" is added to the interactive behavior data of the user with the tag. The added feature values constitute a new list of features.

After preprocessing the data, the complete data of the training data set may be constructed by the big data platform Spark. FIG. 5 is a data representation intent after preprocessing interaction behavior data in an exemplary embodiment of the disclosure. Referring to fig. 5, the first column of the data table is the added feature value, and the following columns are the feature numbers in order: and (4) characteristic value.

By preprocessing the acquired interaction behavior, a more effective and more accurate data source can be provided for the following analysis process.

Next, the preprocessed data is used as input data of the machine learning model.

It is worth noting that prior to testing the data, the machine model needs to be trained using the data set. The trained model may be used to test a data set including a training data set, and in some embodiments of the present disclosure, the data set used for testing may include pre-processed real-time user behavior data obtained from an online data flow.

In an exemplary embodiment of the present disclosure, the gcForest algorithm is chosen as the machine learning algorithm that analyzes user preferences. The gcForest algorithm is a multi-granularity cascade decision tree integration method, and compared with the characteristic learning in a deep neural network, the gcForest algorithm mainly depends on the layer-by-layer processing of original characteristics, and a cascade structure is used for enabling a plurality of forests consisting of decision trees to perform characteristic learning. The multi-granularity scanning input in the gcForest algorithm can enhance the feature learning capability of the cascade forest, can extract features more effectively compared with the traditional logistic regression algorithm, is more suitable for personalized recommendation of large data accuracy, is more suitable for parallel deployment, and has the advantages of simple theoretical analysis, less debugging parameters and the like.

FIG. 6 is a schematic diagram of a multiple granularity cascading forest (gcForest) structure. Referring to fig. 6, each stage in the cascading forest receives feature information processed by a previous stage and outputs a processing result of the stage to a next stage. Each cascade layer contains two random forests and two fully random forests, each fully random forest containing 1000 fully random trees, and each random forest containing 1000 random trees. The gcForest algorithm uses a cascade structure to divide model training into two stages, namely a feature generation stage and a result output stage. In the feature generation stage, randomly selecting a feature from a completely random tree in the completely random forest to classify each node of the tree, and enabling the tree to grow until each leaf node only contains the same class of instances or no more than 10 instances; in contrast, a random tree in a random forest selects features with the number of feature openings as candidate features and selects the feature with the best gini value as a classification feature. Assuming that n classes are to be predicted, each forest outputs n-dimensional class probability vectors, and then the n-dimensional class probability vectors are connected into interactive combination features to serve as input data of a next-level forest.

FIG. 7 is a schematic diagram of class probability vector generation in cascading forests. Referring to FIG. 7, different labels in leaf nodes represent different classes. When a new user instance enters the gcForest model, each forest will calculate the percentage of samples of the different classes at the leaf nodes where the relevant instance falls, and calculate the mean for all trees in the forest to generate an estimate of the distribution of the classes, i.e., each forest will output a class probability vector. To reduce the risk of overfitting, the class probability vectors generated by each forest are generated by K-fold cross validation.

In step S106, the class probability vector output by each level of cascade forest in the gcForest model and the features of the feature data set are used as the input features of the next level of cascade forest.

Fig. 8 is a schematic diagram of an improvement to the gcForest algorithm in an exemplary embodiment of the present disclosure. Referring to fig. 8, an example extracted from a sliding feature window of the same size would be used to train a fully random forest and a random forest, the trained forest generating class probability vectors, and concatenating the class probability vectors into transformed features. Compared with the existing gcForest algorithm, the method has the advantages that the new input interactive characteristics of the next-level cascade forest are constructed through interactive combination of the output original characteristics of the previous-level cascade forest, the improvement is carried out when the new input characteristics are constructed, except that the input characteristics of the first-level cascade forest are 7 preprocessed characteristics, the input characteristics of other cascade forests are the original 7 characteristics, the interactive combination characteristics output by the previous-level cascade forest and the preference probability characteristics of leaf nodes output by the previous-level cascade forest, namely, the preference probability predicted by the previous-level forest is taken as the new characteristics input by the next-level forest. Specifically, the preference probability predicted by the upper-level forest is taken as a new feature, the original 7 features and the interactive combination feature output by the upper-level forest are captured by a sliding window together, and the preference probability is taken as a classification feature and added into the classification feature of the lower-level forest. By adding features to each level of the online forest, the accuracy of the gcForest algorithm can be improved.

Referring to fig. 7, the generic probability vector output by the last layer of the gcForest model is in the form of { a, b, c, d, … … }. The number of elements of the vector is the same as the number of related labels, the sum of the elements is equal to 1, and the meaning of each element is the preference probability of one user to one label. By obtaining the class probability vectors of the instances of the multiple users, the preference scores of the multiple users to the preference labels can be obtained. The calculation of the preference score can be set by persons skilled in the art according to actual conditions, as long as the setting is based on the preference probability of the user to the tag.

Fig. 9 is a table of user preference probability data for tags output in an exemplary embodiment of the present disclosure. Referring to fig. 9, the preference probability data table has a first column of user names, a second and following columns of tags: preference score # tag: preference scores … ….

Fig. 10 is a flowchart of a user preference analysis method in an exemplary embodiment of the present disclosure. Referring to fig. 10, the user preference analysis method 1000 may include, in addition to all the steps of the user preference analysis method 100:

step S1002, acquiring the object class preference data of the user.

Step S1004, the preference probability of the user to the label is corrected according to the object class preference data.

Step S1006, selecting recommended content according to the preference probability.

Step S1008, obtaining click data of the user on the recommended content, and correcting the preference probability according to the click data.

When the user preference analysis method 1000 is used for analyzing the preference degree of a user for an article label of an e-commerce website, the user preference analyzed according to the gcForest algorithm can be associated with the preference of the user for the physical category, and the physical category preference probability is used for expanding and correcting the user preference probability.

Firstly, the corresponding relation between the commodity three-level class and the label can be found, and the preference score of the label is subjected to weight normalization treatment:

(1) taking a user as an association key, acquiring and associating a user-commodity three-level class-preference data table and a user-label-preference data table, and recording an association result as TableA;

(2) in TableA, taking the label number and the product three-level class number as a correlation key, calculating the score of each product three-level class under each label, and marking as score;

(3) in TableA, taking the label number as a correlation main key, calculating the total preference score under each label, and recording as sumSCore;

(4) calculating the total preference score of all the tags, and marking as allScore;

(5) calculating the total score of each commodity three-level class by taking the commodity three-level class as an association main key, and recording the total score as sum;

(6) a filtering threshold is calculated for each tag: calculating the proportion of the score under the label to the total score of the whole labels, wherein the proportion is the filtering threshold value of each label and is marked as tagaratio;

(7) each commodity three-level product corresponds to a plurality of label numbers, and the label numbers are reserved according to the following steps: the label score under the three-level product class of the commodity divided by the total score of the three-level product class of the commodity is more than a filtering threshold value tagratito;

(8) normalized ranking score: calculating the label weight score of the label numbers retained by each commodity three-level class according to the following formula:

the normalization process can diversify the corresponding labels of the user under the three-level commodity category.

The user of the e-commerce website may not have a preference for the article tag, but has a preference for the third-class product of the product, and at this time, the user may be recommended the tag corresponding to the third-class product of the product: associating the user commodity three-level class preference table by using the normalized sorting scores by using the commodity three-level class as an association main key; multiplying the grade preference score of the third-grade product of the user commodity by the label weight score to serve as an expansion label score; and calculating the scores of the user to the expansion labels by taking the user and the labels as combined keys.

By correcting the preference of the user for the label by using the preference of the user for the three-level commodity category, the preference of the user can be acquired more accurately.

In an exemplary embodiment of the present disclosure, further comprising:

step S114, selecting recommended content according to the preference probability;

step S116, obtaining click data of the user to the recommended content, and correcting the preference probability according to the click data.

After the preference of the user for the tags is obtained, the content under the tags with the maximum preference probability of each user can be recommended to the user, wherein the criteria for selecting the tags can be that the number is less than or equal to a threshold value, the preference probability is greater than a threshold value, or the preference score is greater than a threshold value. The present disclosure is not limited thereto.

The user's clicks on recommended content may be recorded by the tag. In some embodiments, the user clicks on the recommended content may be marked as '1', the user does not click on the recommended content may be marked as '0', and in some embodiments, the number of clicks may also be recorded when the user clicks on the recommended content multiple times.

By acquiring the click of the user on the recommended content, the gcForest model can be trained to learn in a direction more conforming to the actual preference of the user, and the learned model is used for predicting new data, namely, the purpose of correcting the preference probability of the user through the click rate is achieved.

By making statistics on whether the recommended content causes the real click behavior of the user or not in combination with the online service and training the model, the online operation model can be promoted by PV (Page view) and UV (independent visitor).

Compared with the traditional logistic regression algorithm which depends on the business experience of a data analyst to balance the characteristic coefficients and has the limitation of statistical analysis on partial data sampled from a small sample, the gcForest algorithm can process massive data, can reprocess complex characteristics and discover the mutual effect among the characteristics, and the model is easy to train, stronger in interpretability than a deep neural network, capable of outputting more accurate judgment results and more suitable for complex business scenes. According to the method and the device, the user preference probability is analyzed by using the improved gcForest, and the user preference is corrected by combining with specific services, so that the user preference is more accurately analyzed, the user experience is optimized, and more benefits are brought to online services.

Corresponding to the method embodiment, the present disclosure also provides a user preference analysis apparatus based on big data, which can be used to execute the method embodiment.

Referring to fig. 11, the big-data-based user preference analysis apparatus 1100 includes:

a data obtaining module 1102, configured to obtain data of interaction behavior between a user and content, where the content has at least one tag.

And the feature preprocessing module 1104 is configured to preprocess the interaction behavior data and generate a feature data set, where the feature data set is used as an input feature value of the gcForest model.

And a cascade forest module 1106, configured to use a class probability vector output by each level of cascade forest in the gcForest model and the features of the feature data set as input features of a next level of cascade forest.

And the preference calculating module 1108 is configured to obtain a preference probability of the user for the tag according to a class probability vector output by the last layer of cascaded forests of the gcForest model.

In an exemplary embodiment of the present disclosure, the feature preprocessing module includes:

and the missing value processing unit 11042 is configured to determine whether missing data exists in the interactive behavior data, and if so, supplement the missing data.

And the abnormal value processing unit 11044 is configured to delete the maximum value and the minimum value of the preset range in the interactive behavior data.

A normalization processing unit 11046, configured to perform feature normalization processing on the interaction behavior data.

In an exemplary embodiment of the present disclosure, the feature preprocessing module further includes:

a feature adding unit 11048, configured to add a list of feature values according to the interactive behavior data and the operation of the content by the user on the day before the current time.

In an exemplary embodiment of the present disclosure, further comprising:

and a real object preference correction module 1110, configured to obtain real object preference data of a user, and correct the preference probability of the user for the tag according to the real object preference data.

In an exemplary embodiment of the present disclosure, further comprising:

and a click rate modification module 1112, configured to select recommended content according to the preference probability, obtain click data of the recommended content from the user, and modify the preference probability according to the click data.

Since the functions of the apparatus 1100 have been described in detail in the corresponding method embodiments, the disclosure is not repeated herein.

According to an aspect of the present disclosure, there is provided a big data-based user preference analysis apparatus including:

a memory; and

a processor coupled to the memory, the processor configured to perform the method of any of the above based on instructions stored in the memory.

The specific manner in which the processor of the apparatus in this embodiment performs operations has been described in detail in the embodiment related to the big data based user preference analysis method, and will not be elaborated herein.

Fig. 12 is a block diagram illustrating an apparatus 1300 according to an example embodiment. The apparatus 1300 may be a mobile terminal such as a smart phone, a tablet computer, etc.

Referring to fig. 12, the apparatus 1200 may include one or more of the following components: a processing component 1202, a memory 1204, a power component 1206, a multimedia component 1208, an audio component 1210, a sensor component 1214, and a communications component 1216.

The processing component 1202 generally controls overall operation of the apparatus 1200, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations, among others. The processing components 1202 may include one or more processors 1218 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 1202 can include one or more modules that facilitate interaction between the processing component 1202 and other components. For example, the processing component 1202 can include a multimedia module to facilitate interaction between the multimedia component 1208 and the processing component 1202.

The memory 1204 is configured to store various types of data to support operation at the apparatus 1200. Examples of such data include instructions for any application or method operating on the apparatus 1200. The memory 1204 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks. Also stored in memory 1204 are one or more modules configured to be executed by the one or more processors 1218 to perform all or a portion of the steps of any of the illustrated methods described above.

A power supply component 1206 provides power to the various components of the device 1200. Power components 1206 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for apparatus 1200.

The multimedia components 1208 include a screen that provides an output interface between the device 1200 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.

Audio component 1210 is configured to output and/or input audio signals. For example, audio component 1210 includes a Microphone (MIC) configured to receive external audio signals when apparatus 1200 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 1204 or transmitted via the communication component 1216. In some embodiments, audio assembly 1210 further includes a speaker for outputting audio signals.

The sensor assembly 1214 includes one or more sensors for providing various aspects of state assessment for the apparatus 1200. For example, the sensor assembly 1214 may detect the open/closed state of the device 1200, the relative positioning of the components, the sensor assembly 1214 may also detect a change in position of the device 1200 or a component of the device 1200, and a change in temperature of the device 1200. In some embodiments, the sensor assembly 1214 may also include a magnetic sensor, a pressure sensor, or a temperature sensor.

The communications component 1216 is configured to facilitate communications between the apparatus 1200 and other devices in a wired or wireless manner. The apparatus 1200 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 1216 receives the broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 1216 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 1200 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment of the present disclosure, there is also provided a computer-readable storage medium having a program stored thereon, the program, when executed by a processor, implementing the big-data based user preference analysis method as any one of the above. The computer-readable storage medium may be, for example, transitory and non-transitory computer-readable storage media including instructions.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims

1. A big data-based user preference analysis method is characterized by comprising the following steps:

acquiring interactive behavior data of a user and content, wherein the content is provided with at least one label;

preprocessing the interactive behavior data to generate a characteristic data set, and taking the characteristic data set as an input characteristic value of a gcForest model;

taking the interactive combination characteristics output by each level of cascade forests in the gcForest model, the preference probability characteristics of leaf nodes and the characteristics of the characteristic data set as the input characteristics of the next level of cascade forests;

acquiring preference probability of a user on the label according to a class probability vector output by the last layer of cascade forest of the gcForest model;

and acquiring the object class preference data of the user, and correcting the preference probability of the user to the label according to the object class preference data.

2. The user preference analysis method according to claim 1, wherein the interactive behavior data includes data of user's operation on the content within a preset time period, and the data includes browsing number, praise number, share number, comment number, number of viewing details, and next number.

3. The user preference analysis method of claim 1, wherein preprocessing the interaction behavior data comprises:

judging whether the interactive behavior data has missing data or not, and if so, supplementing the missing data;

deleting a maximum value and a minimum value of a preset range in the interactive behavior data;

and performing characteristic normalization processing on the interactive behavior data.

4. The user preference analysis method of claim 1, wherein preprocessing the interaction behavior data further comprises:

and adding a list of characteristic values according to the interactive behavior data and the operation of the user on the content on the day before the current time.

5. The user preference analysis method of claim 1, further comprising:

selecting recommended content according to the preference probability;

and acquiring click data of the user on the recommended content, and correcting the preference probability according to the click data.

6. A big data based user preference analysis apparatus, comprising:

the data acquisition module is used for acquiring interactive behavior data of a user and content, and the content is provided with at least one label;

the characteristic preprocessing module is used for preprocessing the interactive behavior data and generating a characteristic data set, and the characteristic data set is used as an input characteristic value of the gcForest model;

the cascade forest module is used for taking the interactive combination characteristics output by each level of cascade forest in the gcForest model, the preference probability characteristics of leaf nodes and the characteristics of the characteristic data set as the input characteristics of the next level of cascade forest;

the preference calculation module is used for acquiring the preference probability of the user on the label according to the class probability vector output by the last layer of cascade forest of the gcForest model;

and the real object preference correction module is used for acquiring real object type preference data of a user and correcting the preference probability of the user to the label according to the real object type preference data.

7. The apparatus according to claim 6, wherein the interactive behavior data includes data of user operation on the content within a preset time period, and the data includes browsing number, praise number, share number, comment number, number of viewing details, and next number.

8. The apparatus of claim 6, wherein the feature preprocessing module comprises:

the missing value processing unit is used for judging whether the interactive behavior data has missing data or not, and if so, supplementing the missing data;

the abnormal value processing unit is used for deleting the maximum value and the minimum value of a preset range in the interactive behavior data;

and the normalization processing unit is used for carrying out characteristic normalization processing on the interactive behavior data.

9. The apparatus of claim 6, wherein the feature preprocessing module further comprises:

and the characteristic increasing unit is used for increasing a list of characteristic values according to the interactive behavior data and the operation of the user on the content in the day before the current time.

10. The apparatus of claim 6, further comprising:

and the click rate correction module is used for selecting recommended contents according to the preference probability, acquiring click data of the recommended contents from the user and correcting the preference probability according to the click data.

11. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method steps of any one of claims 1 to 5.