CN107590224A - User preference analysis method and device based on big data - Google Patents
User preference analysis method and device based on big data Download PDFInfo
- Publication number
- CN107590224A CN107590224A CN201710786530.7A CN201710786530A CN107590224A CN 107590224 A CN107590224 A CN 107590224A CN 201710786530 A CN201710786530 A CN 201710786530A CN 107590224 A CN107590224 A CN 107590224A
- Authority
- CN
- China
- Prior art keywords
- data
- user
- preference
- interbehavior
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The disclosure provides a kind of user preference analysis method and device based on big data.Method includes:The interbehavior data of user and content are obtained, the content has at least one label;The interbehavior data are pre-processed and generate characteristic data set, are the input feature vector value as gcForest models using the characteristic data set;Using input feature vector of the class probability vector of each level connection forest output in the gcForest models with the feature of the characteristic data set as next level connection forest;The class probability vector for joining forest output according to described last level of gcForest models obtains preference probability of the user to the label.The user preference analysis method that the disclosure provides can be based on big data sample and provide more accurate user preference analysis result.
Description
Technical field
This disclosure relates to machine learning techniques field, is analyzed in particular to a kind of user preference based on big data
Method and apparatus.
Background technology
With the development of Internet technology, content personalization is carried out to user and recommends increasingly to popularize.It is recommended as with article
Example, by being that every article sets one or more labels according to article content, and operation of the user to article is obtained, can analyzed
Go out user has preference to which label, so as to recommend other articles under these labels for user, lift Consumer's Experience.
In existing personalized recommendation technology, analyzing the method for user preference mainly includes being based on LR logistic regression algorithms
Analytic approach and statistical formula scoring is drawn according to time weighting to each feature based on analyst's strategy.Returned based on LR logics
In reduction method analytic approach, Data Analyst needs according to business experience analysis to need which feature extracted, and in which way
Labelled to content.After feature and label data is obtained, stratified sampling is carried out to different labels, utilizes statistical analysis software
Logic Regression Models obtain the coefficient of each feature, so that it is determined that user tag preference-score formula.Counted based on time weighting
Scoring assumes that user to the content more preference of the content of nearest selection of time than slightly remote selection of time, so as to be weighed according to the time
A data are safeguarded again, that is, look for a suitable function to determine the time weighting of 1 year 365 day every day, are finally combined each special
Obtain out the statistical formula of having time dimension.
In the above-described techniques, LR logistic regressions Algorithm Analysis method needs analyst to determine each feature according to business experience
Coefficient, analyst's experience is strongly dependent upon, and each business is required for manual analyzing, less efficient, sample number is small.And due to user
It is different to the preference of content in the different periods, it is difficult to when finding most suitable time weighting function, therefore being based on
Between weight statistics scoring be also difficult to accurately excavate user preference.
Therefore, a kind of user preference analysis method algorithm pair that can be handled great amount of samples and more accurate analysis result is provided
Measurer is clicked in lifting personalized recommendation ability and lifting content to be of great importance.
It should be noted that information is only used for strengthening the reason to the background of the disclosure disclosed in above-mentioned background section
Solution, therefore can include not forming the information to prior art known to persons of ordinary skill in the art.
The content of the invention
The purpose of the disclosure is to provide a kind of user preference analysis method and device based on big data, at least existing
One or more problems caused by the limitation of correlation technique and defect are overcome to a certain extent.
According to the first aspect of the embodiment of the present disclosure, there is provided a kind of user preference analysis method based on big data, including:
The interbehavior data of user and content are obtained, the content has at least one label;The interbehavior data are carried out
Pre-process and generate characteristic data set, be the input feature vector value as gcForest models using the characteristic data set;By described in
The class probability vector of each level connection forest output and the feature of the characteristic data set are as next layer in gcForest models
Cascade the input feature vector of forest;The class probability vector for joining forest output according to described last level of gcForest models obtains use
Preference probability of the family to the label.
In a kind of exemplary embodiment of the disclosure, it is right in preset time period that the interbehavior data include user
The data of the operation of the content, the data include browse number, thumb up number, share number, comment on number, check details number, under
Odd number.
In a kind of exemplary embodiment of the disclosure, carrying out pretreatment to the interbehavior data includes:Judge institute
State and whether there is missing data in interbehavior data, if there is then supplementing missing data;Delete the interbehavior data
The maximum and minimum of middle preset range;Feature normalization processing is done to the interbehavior data.
In a kind of exemplary embodiment of the disclosure, the interbehavior data, which are pre-processed, also to be included:According to
Operation of the previous day user of the interbehavior data and current time to the content increases by a row characteristic value.
In a kind of exemplary embodiment of the disclosure, in addition to:Obtain the category preference data in kind of user;According to institute
User described in category preference data amendment in kind is stated to the preference probability of the label.
In a kind of exemplary embodiment of the disclosure, in addition to:According to the preference probability selection content recommendation;Obtain
User is to the click data of the content recommendation, according to preference probability described in the click data amendment.
According to the second aspect of the disclosure, there is provided a kind of user preference analytical equipment based on big data, including:Data obtain
Modulus block, for obtaining the interbehavior data of user and content, the content has at least one label;Feature pre-processes mould
Block, for being pre-processed to the interbehavior data and generating characteristic data set, using the characteristic data set be as
The input feature vector value of gcForest models;Forest module is cascaded, for each level in the gcForest models to be joined into forest
The class probability vector of output and input feature vector of the feature of the characteristic data set as next level connection forest;Preference calculates mould
Block, the class probability vector for joining forest output according to described last level of gcForest models obtain user to the label
Preference probability.
In a kind of exemplary embodiment of the disclosure, it is right in preset time period that the interbehavior data include user
The data of the operation of the content, the data include browse number, thumb up number, share number, comment on number, check details number, under
Odd number.
In a kind of exemplary embodiment of the disclosure, the feature pretreatment module includes:Missing values processing unit, use
In judging to whether there is missing data in the interbehavior data, if there is then supplementing missing data;Outlier processing list
Member, for deleting the maximum and minimum of preset range in the interbehavior data;Normalized unit, for institute
State interbehavior data and do feature normalization processing.
In a kind of exemplary embodiment of the disclosure, the feature pretreatment module also includes:Feature adding unit, use
According to operation increase by one row characteristic value of the previous day user of the interbehavior data and current time to the content.
In a kind of exemplary embodiment of the disclosure, in addition to:Preference correcting module in kind, for obtaining the reality of user
Article class preference data, and preference probability of the user to the label according to the category preference data amendment in kind.
In a kind of exemplary embodiment of the disclosure, in addition to:Clicking rate correcting module, for general according to the preference
Rate selects content recommendation, and obtains click data of the user to the content recommendation, according to the click data amendment partially
Good probability.
According to the third aspect of the disclosure, there is provided a kind of computer-readable recording medium, computer program is stored thereon with,
The program realizes the method and step described in above-mentioned any one when being executed by processor.
By the present invention in that big data sample is distributed with more granularities cascade forest algorithm gcForest after improving
Formula processing, and user is analyzed to the preference of content tab according to output result, it can be obtained under conditions of using more rich data
More accurate user preference analysis result is taken, lifts personalized recommendation efficiency, improves Consumer's Experience.
It should be appreciated that the general description and following detailed description of the above are only exemplary and explanatory, not
The disclosure can be limited.
Brief description of the drawings
Accompanying drawing herein is merged in specification and forms the part of this specification, shows the implementation for meeting the disclosure
Example, and be used to together with specification to explain the principle of the disclosure.It should be evident that drawings in the following description are only the disclosure
Some embodiments, for those of ordinary skill in the art, on the premise of not paying creative work, can also basis
These accompanying drawings obtain other accompanying drawings.
Fig. 1 schematically shows the flow of the user preference analysis method based on big data in disclosure exemplary embodiment
Figure.
Fig. 2 is the schematic diagram of interbehavior data in disclosure exemplary embodiment.
Fig. 3 is the flow chart pre-processed in disclosure exemplary embodiment to interbehavior data.
Fig. 4 is the flow chart that missing values in interbehavior data are handled in disclosure exemplary embodiment.
Fig. 5 is to carry out pretreated tables of data schematic diagram to interbehavior data in disclosure exemplary embodiment.
Fig. 6 is the schematic diagram of more granularity cascade forest (gcForest) structures.
Fig. 7 is to cascade the class probability vector generation schematic diagram in forest.
Fig. 8 is the schematic diagram being improved in disclosure exemplary embodiment to gcForest algorithms.
Fig. 9 is the preference probability data table of the user that is exported in disclosure exemplary embodiment to label.
Figure 10 is a kind of user preference analysis method flow chart in disclosure exemplary embodiment.
Figure 11 schematically shows a kind of user preference analysis dress based on big data in one exemplary embodiment of the disclosure
The block diagram put.
Figure 12 schematically shows user preference analysis of the another kind based on big data in one exemplary embodiment of the disclosure
The block diagram of device.
Embodiment
Example embodiment is described more fully with referring now to accompanying drawing.However, example embodiment can be with a variety of shapes
Formula is implemented, and is not understood as limited to example set forth herein;On the contrary, these embodiments are provided so that the disclosure will more
Fully and completely, and by the design of example embodiment comprehensively it is communicated to those skilled in the art.Described feature, knot
Structure or characteristic can be incorporated in one or more embodiments in any suitable manner.In the following description, there is provided permitted
More details fully understand so as to provide to embodiment of the present disclosure.It will be appreciated, however, by one skilled in the art that can
Omitted with putting into practice the technical scheme of the disclosure one or more in the specific detail, or others side can be used
Method, constituent element, device, step etc..In other cases, be not shown in detail or describe known solution a presumptuous guest usurps the role of the host to avoid and
So that each side of the disclosure thickens.
In addition, accompanying drawing is only the schematic illustrations of the disclosure, identical reference represents same or similar portion in figure
Point, thus repetition thereof will be omitted.Some block diagrams shown in accompanying drawing are functional entitys, not necessarily necessary and thing
The entity managed or be logically independent is corresponding.These functional entitys can be realized using software form, or in one or more
These functional entitys are realized in hardware module or integrated circuit, or in heterogeneous networks and/or processor device and/or microcontroller
These functional entitys are realized in device.
Disclosure example embodiment is described in detail below in conjunction with the accompanying drawings.
Fig. 1 schematically shows the flow of the user preference analysis method based on big data in disclosure exemplary embodiment
Figure.With reference to figure 1, the user preference analysis method 100 based on big data includes:
Step S102, obtains the interbehavior data of user and content, and the content has at least one label.
Step S104, the interbehavior data are pre-processed and generate characteristic data set, by the characteristic
Input feature vector value of the collection i.e. as gcForest models.
Step S106, by the class probability vector of each level connection forest output and the feature in the gcForest models
Input feature vector of the feature of data set as next level connection forest.
Step S108, the class probability vector that forest output is joined according to described last level of gcForest models obtain user
To the preference probability of the label.
Distributed treatment, and root are carried out to Massive Sample by using improved more granularity cascade forest algorithm gcForest
User is analyzed to the preference of content tab according to output result, and more accurate use can be obtained under conditions of using more rich data
Family preference analysis result, personalized recommendation efficiency is lifted, improve Consumer's Experience.
Below, each step of method 100 is described in detail.
In step S102, the interbehavior data of acquisition user and content, the content has at least one label.
The disclosure signified " content " include but is not limited to article, commodity, music, video, books or other can be pushed away
Recommend the content to user.For convenience of description, only so that article is recommended as an example, relevant technical staff in the field can voluntarily be set the disclosure
This method is applied to the personalized recommendation of other guide.
Fig. 2 is the schematic diagram of mutual behavioral data in disclosure exemplary embodiment.With reference to figure 2, interbehavior data can be with
The data of operation including user in preset time period to the content, the data include browsing number, thumb up number, share
Number, comment on number, check details number, lower odd number.
Specifically, article channel is found in Jingdone district, the user for having direct action to article content in 90 days is about
13000000, this quantity can meet analyze data demand, therefore above-mentioned preset time period can be configured to 90 days, interaction row
Can be the user behavior in 90 days for data.But in certain embodiments, analyze user thumb up feature, user shares
The behaviors such as feature can choose operation data of the user in 30 days.
Following six feature can be extracted to the user behavior recorded in database by HIVE:
Feature 1:User goes over to browse fraction feature to label in 90 days;
Feature 2:User goes over to thumb up fraction feature to label in 90 days;
Feature 3:User goes over to share fraction feature to label in 30 days;
Feature 4:User goes over 30 days comment fraction features to label;
Feature 5:User goes over 30 days fraction features that commodity details are clicked in Shipping Options Page;
Feature 6:User's order fraction feature as caused by label.
HIVE is a data base tool based on Hadoop, and the data file of structuring can be mapped as to a data
Storehouse table, and simple SQL query function is provided, it is very suitable for the statistical analysis of database.Each fraction feature bag of said extracted
Quantity of the user behavior in preset time period is included, can be according to the tool of each dimensional characteristics when handling each fraction feature
Body business, it is that each data set characteristic value by data prediction.In some embodiments, it is also possible to according to each user behavior
Specific weight set characteristic value, characteristic value for example can be with user behavior quantity or the weighted value of user behavior quantity.Set special
The method of value indicative can be set by relevant technical staff in the field according to actual conditions, and the disclosure is not particularly limited to this.
The disclosure clicks on the correlation of conversion ratio by calculating each feature in big measure feature to target prediction variable user
Property significance level, according to information gain choose above most worthy six features, user's row to more than 90% can be covered
For the analysis of information.
As shown in Fig. 2 the characteristic form first extracted is classified as user name, second is classified as content tab, the 3rd row
The characteristic value for six features for being classified as HIVE extractions to the 8th.
In step S104, the interbehavior data are pre-processed and generate characteristic data set, by the characteristic
Input feature vector value according to collection i.e. as gcForest models.
Fig. 3 is the flow chart pre-processed in disclosure exemplary embodiment to interbehavior data.It is right with reference to figure 3
The flow that interbehavior data are pre-processed can include:
Step S302, judge to whether there is missing data in the interbehavior data, if there is then supplementing missing number
According to.
Step S304, delete the maximum and minimum of preset range in the interbehavior data.
Step S306, feature normalization processing is done to the interbehavior data.
Fig. 4 is the flow chart that missing values in interbehavior data are handled in disclosure exemplary embodiment.
With reference to figure 4, in step S302, after obtaining data, initially enter step S3020 and judge whether missing values,
If there is no missing values, then into step S304;If there is missing values, then into step S3021, whether missing values are judged
It is significant.Meaning refers to that characteristic has predicted whether key effect for target.For example, six characteristic values of a user
Only missing one, then may determine that this user is any active ues, and missing feature does not influence to predict the behavior of the user, now may be used
To judge that the missing values are meaningless.In certain embodiments, judge that missing values can be by calculating the ratios of missing values and existing value
Example, when the ratio is less than threshold value, it is possible to determine that missing values are meaningless, when the ratio is more than or equal to threshold value, then it is assumed that lack
Mistake value is important, significant.
It is that missing values create one into step S3022 when it is classification type that missing values are significant and are missing values
Classification;When missing values are significant and missing values are numeric type, missing values are arranged to rational into step S3023, such as
Missing values are arranged to the average or median of all characteristic values under this feature;When missing values are nonsensical, into step
S3024 judges specific shortage of data situation.If overall data missing data is few (being less than a threshold value), into step
S3025 deletes the example of missing data;If overall data is temporally orderly, into step S3026 then with than overall data when
Between a value in data earlier replace the missing values (such as selection time closest to the data of overall data earliest time);
If other situations, then judge whether overall data obeys simple distribution into step S3027, if overall data is disobeyed
Simple distribution, then produce the substitution value of missing values using simple machine learning model into step S3028 and enter step S304;
If overall data obeys simple distribution and is no different constant value, being substituted into step S3029 using the average value of the column data should
Missing values simultaneously enter step S304;If overall data obeys simple distribution and has exceptional value, being used into step S3030 should
The median of column data substitutes the missing values and enters step S304.Exceptional value is in one group of numerical value and the deviation of average value exceedes
The numerical value of twice of standard deviation.
In step S304, by analyzing the data distribution of each feature, the preceding x% and rear y% of data distribution data are deleted,
The exceptional value in data can be deleted.Wherein, x and y is the natural number between 1~99, be can be the same or different, in this public affairs
In the exemplary embodiment opened, x and y can be for example 5, you can with by deleting 5% data maximum in data and most
Small 5% data delete the exceptional value in data.
In step S306, normalized can be done to the data of each feature.The formula of data normalization for example can be with
For:
Wherein, yinewBe normalization after characteristic value, yiIt is former data, yminBe this feature all existing datas in most
Small value, ymaxBe this feature all existing datas in maximum.Characteristic after normalization is distributed between 0 to 1.
In a kind of exemplary embodiment of the disclosure, carrying out pretreatment to interbehavior data can also include:Step
S308, a row feature is increased according to operation of the previous day user of the interbehavior data and current time to the content
Value.
On the basis of with reference to behavior of the user in preset time period, it can observe before whether user clicked yesterday
There is the content of the label of operation.If it is determined that user clicked the content for the label for once having operation in yesterday, then to the user
With the interbehavior data addition characteristic value " 1 " of the label;If it is determined that user once had the label of operation without clicking in yesterday
Content, then characteristic value " 0 " is added to the interbehavior data of the user and the label.Characteristic value after addition constitutes a row
New feature.
After being pre-processed to data, the partial data of training dataset can be constructed by big data platform Spark.
Fig. 5 is to carry out pretreated tables of data schematic diagram to interbehavior data in disclosure exemplary embodiment.With reference to figure 5, number
The characteristic value of addition according to table first row, behind each leu time be feature number:Characteristic value.
By being pre-processed to the interbehavior got, can be provided for ensuing analysis process more effectively, more
Accurate data source.
Next, using pretreated data as the input data of machine learning model.
What deserves to be explained is, it is necessary to be trained using data set to machine mould before test data.After training
Model can be used for the data set tested including training dataset, in some embodiments of the present disclosure, for test
Data set can include the active user behavioral data by pretreatment obtained from line in data flow.
In an exemplary embodiment of the disclosure, the machine learning for choosing gcForest algorithms as analysis user preference is calculated
Method.GcForest (more granularities cascade forest) algorithm is a kind of more granularity cascade decision tree integrated approaches, compared to depth nerve
Feature learning in network, which depends on, successively to be handled primitive character, gcForest algorithms using cascade structure allow by
Multiple forests of decision tree composition are cooked feature learning.More granularities scanning input in gcForest algorithms can strengthen cascade forest
Feature learning ability, compared to traditional logistic regression algorithm, more effective feature extraction can be carried out, be more suitable for big data essence
The personalized recommendation of standardization, it is more suitable for disposing parallel, and there is the advantages that theory analysis is simple, and tuning parameter is less.
Fig. 6 is the schematic diagram of more granularity cascade forest (gcForest) structures.With reference to figure 6, every one-level in forest is cascaded
The characteristic information handled by previous stage is received, and the result of this grade is exported to next stage.Each cascading layers include two
Individual random forest and two completely random forests, each completely random forest include 1000 completely random trees, each random gloomy
Woods includes 1000 random trees.Model training is divided into feature generation phase using cascade structure for gcForest algorithms and result is defeated
Go out two stages of stage.Completely random tree in feature generation phase, completely random forest randomly chooses a feature and set
Each node classified, tree is grown always, until each leaf node is only comprising mutually similar example or no more than 10
Individual example;Relatively, the random tree selection feature in random forest opens the feature of radical sign number as candidate feature, and selects to have
There is the feature of optimal gini values as characteristic of division.Assuming that there is n class to predict, each forest will export n dimension class probability vectors,
Then input data of the combination of interactions feature as next stage forest is connected it as.
Fig. 7 is to cascade the class probability vector generation schematic diagram in forest.With reference to figure 7, the not isolabeling in leaf node represents
Different class.When a new customer instance enters gcForest models, each forest can be calculated in related example
Inhomogeneous sample percentage at the leaf node fallen into, average value is calculated to all trees in forest, class divided with generation
The estimation of cloth, i.e., each forest can export a class probability vector.In order to reduce over-fitting risk, class caused by each forest is general
Rate vector is by K folding cross validation generations.
In step S106, each level in the gcForest models is joined into the class probability vector of forest output and the spy
Levy input feature vector of the feature of data set as next level connection forest.
Fig. 8 is the schematic diagram being improved in disclosure exemplary embodiment to gcForest algorithms.With reference to figure 8, slave phase
The example extracted with the sliding characteristics window of size will be used to train completely random forest and random forest, trained forest
Class probability vector is generated, and class probability vector is connected as the feature after conversion.Pass through compared to existing gcForest algorithms
The new input interaction feature of upper level cascade forest output primitive character combination of interactions construction next stage cascade forest, the present invention
Improved when constructing new input feature vector, in addition to the input feature vector of first layer cascade forest is pretreated 7 features,
Other cascade forests input feature vectors be original 7 features, upper level cascade forest output combination of interactions feature with it is upper
The preference probability characteristics of the leaf node of one-level cascade forest output, i.e. using the preference probability of upper level forest prediction as under
The new feature of one-level forest input.Specifically, the preference probability that upper level forest is predicted is made as new feature and original 7 spies
Sign and the combination of interactions feature of last layer forest output are captured by sliding window together, and using preference probability as characteristic of division
Add in the characteristic of division of next stage forest.Increase feature by joining forest to each level, gcForest algorithms can be improved
Accuracy.
Step S108, the class probability vector that forest output is joined according to described last level of gcForest models obtain user
To the preference probability of the label.
In this reference chart 7, the form of the class probability vector of gcForest models last level connection forest output for a, b,
c,d,……}.Wherein, vectorial number of elements is identical with the number of labels being related to, and each element sum is equal to 1, each element
Preference probability of the implication for user to a label., can be with by the class probability vector for the example for obtaining several users
Obtain preference fraction of the several users to its preference label.The calculating of preference fraction can be by relevant technical staff in the field voluntarily
Set according to actual conditions, as long as installation warrants are preference probability of the user to label.
Fig. 9 is the preference probability data table of the user that is exported in disclosure exemplary embodiment to label.With reference to figure 9, partially
Good probability data table first row is user name, and secondary series and row afterwards are labels:Preference fraction # labels:Preference point
Number ....
Figure 10 is a kind of user preference analysis method flow chart in disclosure exemplary embodiment.Figure 10 of reference, user
Preference analysis method 1000 can also include in addition to the Overall Steps including user preference analysis method 100:
Step S1002, obtain the category preference data in kind of user.
Step S1004, according to user described in the category preference data amendment in kind to the preference probability of the label.
Step S1006, according to the preference probability selection content recommendation.
Step S1008, click data of the user to the content recommendation is obtained, according to the click data amendment partially
Good probability.
, can be with when user preference analysis method 1000 is used to analyze user to the preference of electric business website story label
It is general with category preference in kind being associated according to the user preference that gcForest Algorithm Analysis goes out with user to the preference of category in kind
Rate corrects user preference probability to expand.
It is possible, firstly, to find the corresponding relation of commodity three-level category and label, and weight is done to the preference fraction of label and returned
One change is handled:
(1) using user as association major key, simultaneously association user-commodity three-level category-preference data table and user-mark are obtained
Label-preference data table, association results are denoted as TableA;
(2) in TableA, it is to associate major key to be numbered with tag number with commodity three-level category, is calculated under each label
The fraction of each commodity three-level category, is designated as score;
(3) in TableA, using tag number as association major key, the preference fraction total score under each label is calculated, is remembered
For sumScore;
(4) the preference fraction total score of all labels is calculated, is designated as allScore;
(5) using commodity three-level category as association major key, the gross score under each commodity three-level category is calculated, is designated as sum;
(6) to one filtering threshold of each tag computation:Calculating the score under this label, to account for overall label score total
The ratio of sum, the ratio are exactly the filtering threshold of each label, are designated as tagRatio;
(7) multiple tag numbers can be corresponded under each commodity three-level category, tag number, which is left the foundation come, is:The business
Label score divided by this commodity three-level category gross score under product three-level category are greater than filtering threshold tagRatio;
(8) ranking score is normalized:Label weight fraction is calculated to the tag number that each commodity three-level category is retained, it is public
Formula is as follows:
Normalized can make user's corresponding label variation under commodity three-level category.
The user of electric business website may be to the preference of story label, but has to the inclined of the commodity three-level category of commodity
It is good, it can now give label corresponding to user's Recommendations three-level category:Using commodity three-level category as association major key, with normalization
Ranking score association user commodity three-level category preference table;The fraction of user's commodity three-level category preference is multiplied by label weight point
Number is used as expansion amount label fraction;Using user and label as Macintosh, fraction of the user to expansion amount label is calculated.
The preference amendment user of commodity three-level category can more accurately be obtained to the preference of label by using user
The preference of user.
In a kind of exemplary embodiment of the disclosure, in addition to:
Step S114, according to the preference probability selection content recommendation;
Step S116, click data of the user to the content recommendation is obtained, according to the click data amendment partially
Good probability.
, can be by under several labels of every user preference maximum probability after user is obtained to the preference of label
Appearance recommends user, wherein, threshold value can be less than or equal to for quantity by selecting the standard of label, or preference probability is more than one
Threshold value, or it is more than a threshold value for preference fraction.The disclosure is not particularly limited to this.
Can be by marking click of the record user to content recommendation.In certain embodiments, user can be clicked on and pushed away
It is ' 1 ' to recommend content-label, user is not clicked on to content recommendation labeled as ' 0 ', in some embodiments, it is also possible to multiple in user
Number of clicks is recorded when clicking on content recommendation.
By obtaining click of the user to content recommendation, above-mentioned gcForest models can be trained real towards user is more met
The direction study of border preference, the model for learning are used for predicting new data, that is, realized general by clicking rate amendment user preference
The purpose of rate.
Whether cause user really to click on behavior to content recommendation by business on joint line and make statistics, and instructed with this
Practice model, PV (PageView, page browsing amount), UV (UniqueVisitor, independent visit can be brought to operation model on line
Visitor) lifting.
The business experience that Data Analyst is relied on compared to traditional logistic regression algorithm weighs characteristic coefficient, in sample
Sampled portions data do the limitation of statistical analysis in sheet, and gcForest algorithms can handle mass data, and can be to complexity
Feature is re-worked, and finds the mutual effect between feature, and model is easily trained, and interpretation is stronger than deep neural network, energy
It is enough to export more accurately judged result, it is more suitable for the business scenario of complexity.The disclosure passes through with gcForest points after improving
User preference probability is analysed, and combines specific business amendment user preference, user preference is more accurately analyzed, optimizes user's body
Test, more incomes are brought for business on line.
Corresponding to above method embodiment, the disclosure also provides a kind of user preference analytical equipment based on big data, can
For performing above method embodiment.
Figure 11 schematically shows a kind of user preference analysis dress based on big data in one exemplary embodiment of the disclosure
The block diagram put.
With reference to figure 11, the user preference analytical equipment 1100 based on big data includes:
Data acquisition module 1102, for obtaining the interbehavior data of user and content, the content has at least one
Individual label.
Feature pretreatment module 1104, for being pre-processed to the interbehavior data and generating characteristic data set,
It is the input feature vector value as gcForest models using the characteristic data set.
Cascade forest module 1106, for by the class probability of each level connection forest output in the gcForest models to
The input feature vector of amount and the feature of the characteristic data set as next level connection forest.
Preference computing module 1108, for joining the class probability of forest output according to described last level of gcForest models
Vector obtains preference probability of the user to the label.
In a kind of exemplary embodiment of the disclosure, it is right in preset time period that the interbehavior data include user
The data of the operation of the content, the data include browse number, thumb up number, share number, comment on number, check details number, under
Odd number.
In a kind of exemplary embodiment of the disclosure, the feature pretreatment module includes:
Missing values processing unit 11042, for judging to whether there is missing data in the interbehavior data, if deposited
Then supplementing missing data.
Outlier processing unit 11044, for delete in the interbehavior data maximum of preset range with it is minimum
Value.
Normalized unit 11046, for doing feature normalization processing to the interbehavior data.
In a kind of exemplary embodiment of the disclosure, the feature pretreatment module also includes:
Feature adding unit 11048, for the previous day user couple according to the interbehavior data and current time
The operation of the content increases by a row characteristic value.
In a kind of exemplary embodiment of the disclosure, in addition to:
Preference correcting module 1110 in kind, for obtaining the category preference data in kind of user, and according to the product in kind
Preference probability of the user to the label described in class preference data amendment.
In a kind of exemplary embodiment of the disclosure, in addition to:
Clicking rate correcting module 1112, for according to the preference probability selection content recommendation, and user is obtained to described
The click data of content recommendation, according to preference probability described in the click data amendment.
Because each function of device 1100 has been described in detail in its corresponding embodiment of the method, the disclosure in this not
Repeat again.
According to an aspect of this disclosure, there is provided a kind of user preference analytical equipment based on big data, including:
Memory;And
The processor of memory, the processor are configured as based on the finger being stored in the memory belonging to being coupled to
Order, perform the method as described in above-mentioned any one.
The concrete mode of the computing device operation of device in the embodiment is about being somebody's turn to do the use based on big data
Detailed description is performed in the embodiment of family preference analysis method, explanation will be not set forth in detail herein.
Figure 12 is a kind of block diagram of device 1300 according to an exemplary embodiment.Device 1300 can be intelligent hand
The mobile terminals such as machine, tablet personal computer.
Reference picture 12, device 1200 can include following one or more assemblies:Processing component 1202, memory 1204,
Power supply module 1206, multimedia groupware 1208, audio-frequency assembly 1210, sensor cluster 1214 and communication component 1216.
The integrated operation of the usual control device 1200 of processing component 1202, such as communicated with display, call, data,
Operation that camera operation and record operation are associated etc..Processing component 1202 can include one or more processors 1218
Execute instruction, to complete all or part of step of above-mentioned method.In addition, processing component 1202 can include one or more
Module, the interaction being easy between processing component 1202 and other assemblies.For example, processing component 1202 can include multimedia mould
Block, to facilitate the interaction between multimedia groupware 1208 and processing component 1202.
Memory 1204 is configured as storing various types of data to support the operation in device 1200.These data
Example includes the instruction of any application program or method for being operated on device 1200.Memory 1204 can be by any class
The volatibility or non-volatile memory device or combinations thereof of type are realized, such as static RAM (SRAM), electricity
Erasable Programmable Read Only Memory EPROM (EEPROM), Erasable Programmable Read Only Memory EPROM (EPROM), programmable read only memory
(PROM), read-only storage (ROM), magnetic memory, flash memory, disk or CD.One is also stored with memory 1204
Individual or multiple modules, one or more modules are configured to be performed by the one or more processors 1218, above-mentioned to complete
All or part of step in method shown in any.
Power supply module 1206 provides electric power for the various assemblies of device 1200.Power supply module 1206 can include power management
System, one or more power supplys, and other components associated with generating, managing and distributing electric power for device 1200.
Multimedia groupware 1208 is included in the screen of one output interface of offer between described device 1200 and user.
In some embodiments, screen can include liquid crystal display (LCD) and touch panel (TP).If screen includes touch panel,
Screen may be implemented as touch-screen, to receive the input signal from user.Touch panel includes one or more touch and passed
Sensor is with the gesture on sensing touch, slip and touch panel.The touch sensor can not only sensing touch or slip be dynamic
The border of work, but also detect the duration and pressure related to the touch or slide.
Audio-frequency assembly 1210 is configured as output and/or input audio signal.For example, audio-frequency assembly 1210 includes a wheat
Gram wind (MIC), when device 1200 is in operator scheme, during such as call model, logging mode and speech recognition mode, microphone quilt
It is configured to receive external audio signal.The audio signal received can be further stored in memory 1204 or via communication
Component 1216 is sent.In certain embodiments, audio-frequency assembly 1210 also includes a loudspeaker, for exports audio signal.
Sensor cluster 1214 includes one or more sensors, and the state for providing various aspects for device 1200 is commented
Estimate.For example, sensor cluster 1214 can detect opening/closed mode of device 1200, the relative positioning of component, sensor
Component 1214 can be changed with the position of 1,200 1 components of detection means 1200 or device and the temperature change of device 1200.
In certain embodiments, the sensor cluster 1214 can also include Magnetic Sensor, pressure sensor or temperature sensor.
Communication component 1216 is configured to facilitate the communication of wired or wireless way between device 1200 and other equipment.Dress
The wireless network based on communication standard, such as WiFi, 2G or 3G, or combinations thereof can be accessed by putting 1200.It is exemplary at one
In embodiment, communication component 1216 receives broadcast singal or broadcast correlation from external broadcasting management system via broadcast channel
Information.In one exemplary embodiment, the communication component 1216 also includes near-field communication (NFC) module, to promote short distance
Communication.For example, radio frequency identification (RFID) technology, Infrared Data Association (IrDA) technology, ultra wide band can be based in NFC module
(UWB) technology, bluetooth (BT) technology and other technologies are realized.
In the exemplary embodiment, device 1200 can be by one or more application specific integrated circuits (ASIC), numeral
Signal processor (DSP), digital signal processing appts (DSPD), PLD (PLD), field programmable gate array
(FPGA), controller, microcontroller, microprocessor or other electronic components are realized, for performing the above method.
In a kind of exemplary embodiment of the disclosure, a kind of computer-readable recording medium is additionally provided, is stored thereon
There is program, the user preference analysis side based on big data as described in above-mentioned any one is realized when the program is executed by processor
Method.The computer-readable recording medium for example can be the provisional and non-transitory computer-readable storage medium for including instruction
Matter.
Those skilled in the art will readily occur to the disclosure its after considering specification and putting into practice invention disclosed herein
Its embodiment.The application is intended to any modification, purposes or the adaptations of the disclosure, these modifications, purposes or
Person's adaptations follow the general principle of the disclosure and including the undocumented common knowledges in the art of the disclosure
Or conventional techniques.Description and embodiments are considered only as exemplary, and the true scope of the disclosure and design will by right
Ask and point out.
Claims (13)
- A kind of 1. user preference analysis method based on big data, it is characterised in that including:The interbehavior data of user and content are obtained, the content has at least one label;The interbehavior data are pre-processed and generate characteristic data set, using the characteristic data set be as The input feature vector value of gcForest models;By the class probability vector and the feature of the characteristic data set of each level connection forest output in the gcForest models Input feature vector as next level connection forest;The class probability vector for joining forest output according to described last level of gcForest models obtains user to the label Preference probability.
- 2. user preference analysis method according to claim 1, it is characterised in that the interbehavior data include user The data of operation in preset time period to the content, the data include browse number, thumb up number, share number, comment number, Check details number, lower odd number.
- 3. user preference analysis method according to claim 1, it is characterised in that the interbehavior data are carried out pre- Processing includes:Judge to whether there is missing data in the interbehavior data, if there is then supplementing missing data;Delete the maximum and minimum of preset range in the interbehavior data;Feature normalization processing is done to the interbehavior data.
- 4. user preference analysis method according to claim 1, it is characterised in that the interbehavior data are carried out pre- Processing also includes:One row feature is increased according to operation of the previous day user of the interbehavior data and current time to the content Value.
- 5. user preference analysis method according to claim 1, it is characterised in that also include:Obtain the category preference data in kind of user;According to user described in the category preference data amendment in kind to the preference probability of the label.
- 6. user preference analysis method according to claim 1, it is characterised in that also include:According to the preference probability selection content recommendation;Click data of the user to the content recommendation is obtained, according to preference probability described in the click data amendment.
- A kind of 7. user preference analytical equipment based on big data, it is characterised in that including:Data acquisition module, for obtaining the interbehavior data of user and content, the content has at least one label;Feature pretreatment module, for being pre-processed to the interbehavior data and generating characteristic data set, by the spy It is the input feature vector value as gcForest models to levy data set;Cascade forest module, for by the class probability vector of each level connection forest output in the gcForest models with it is described Input feature vector of the feature of characteristic data set as next level connection forest;Preference computing module, the class probability vector for joining forest output according to described last level of gcForest models obtain Preference probability of the user to the label.
- 8. user preference analytical equipment according to claim 7, it is characterised in that the interbehavior data include user The data of operation in preset time period to the content, the data include browse number, thumb up number, share number, comment number, Check details number, lower odd number.
- 9. user preference analytical equipment according to claim 7, it is characterised in that the feature pretreatment module includes:Missing values processing unit, for judging to whether there is missing data in the interbehavior data, if there is then supplementing Missing data;Outlier processing unit, for deleting the maximum and minimum of preset range in the interbehavior data;Normalized unit, for doing feature normalization processing to the interbehavior data.
- 10. user preference analytical equipment according to claim 7, it is characterised in that the feature pretreatment module is also wrapped Include:Feature adding unit, for according to the previous day user of the interbehavior data and current time to the content Operation one row characteristic value of increase.
- 11. user preference analytical equipment according to claim 7, it is characterised in that also include:Preference correcting module in kind, for obtaining the category preference data in kind of user, and according to the category preference number in kind Preference probability according to the amendment user to the label.
- 12. user preference analytical equipment according to claim 7, it is characterised in that also include:Clicking rate correcting module, for according to the preference probability selection content recommendation, and user is obtained to the content recommendation Click data, according to preference probability described in the click data amendment.
- 13. a kind of computer-readable recording medium, is stored thereon with computer program, it is characterised in that the program is by processor The method and step described in claim any one of 1-6 is realized during execution.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710786530.7A CN107590224B (en) | 2017-09-04 | 2017-09-04 | Big data based user preference analysis method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710786530.7A CN107590224B (en) | 2017-09-04 | 2017-09-04 | Big data based user preference analysis method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107590224A true CN107590224A (en) | 2018-01-16 |
CN107590224B CN107590224B (en) | 2021-11-30 |
Family
ID=61051872
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710786530.7A Active CN107590224B (en) | 2017-09-04 | 2017-09-04 | Big data based user preference analysis method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107590224B (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108416623A (en) * | 2018-02-27 | 2018-08-17 | 苏州竹语网络科技有限公司 | Information recommendation method and device |
CN109272534A (en) * | 2018-05-16 | 2019-01-25 | 西安电子科技大学 | SAR image change detection based on more granularities cascade forest model |
CN109785034A (en) * | 2018-11-13 | 2019-05-21 | 北京码牛科技有限公司 | User's portrait generation method, device, electronic equipment and computer-readable medium |
CN110490625A (en) * | 2018-05-11 | 2019-11-22 | 北京京东尚科信息技术有限公司 | User preference determines method and device, electronic equipment, storage medium |
CN110490114A (en) * | 2019-08-13 | 2019-11-22 | 西北工业大学 | Target detection barrier-avoiding method in a kind of unmanned plane real-time empty based on depth random forest and laser radar |
CN110619585A (en) * | 2019-08-16 | 2019-12-27 | 广州越秀金融科技有限公司 | Method, device, storage medium and processor for recommending data |
CN111324733A (en) * | 2020-02-07 | 2020-06-23 | 北京创鑫旅程网络技术有限公司 | Content recommendation method, device, equipment and storage medium |
CN111652278A (en) * | 2020-04-30 | 2020-09-11 | 中国平安财产保险股份有限公司 | User behavior detection method and device, electronic equipment and medium |
CN112685618A (en) * | 2019-10-17 | 2021-04-20 | 中国移动通信集团浙江有限公司 | User feature identification method and device, computing equipment and computer storage medium |
CN114048392A (en) * | 2022-01-13 | 2022-02-15 | 北京达佳互联信息技术有限公司 | Multimedia resource pushing method and device, electronic equipment and storage medium |
CN114169523A (en) * | 2022-02-10 | 2022-03-11 | 一道新能源科技(衢州)有限公司 | Solar cell use data analysis method and system |
CN116451087A (en) * | 2022-12-20 | 2023-07-18 | 石家庄七彩联创光电科技有限公司 | Character matching method, device, terminal and storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102075851A (en) * | 2009-11-20 | 2011-05-25 | 北京邮电大学 | Method and system for acquiring user preference in mobile network |
CN105868847A (en) * | 2016-03-24 | 2016-08-17 | 车智互联(北京)科技有限公司 | Shopping behavior prediction method and device |
-
2017
- 2017-09-04 CN CN201710786530.7A patent/CN107590224B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102075851A (en) * | 2009-11-20 | 2011-05-25 | 北京邮电大学 | Method and system for acquiring user preference in mobile network |
CN105868847A (en) * | 2016-03-24 | 2016-08-17 | 车智互联(北京)科技有限公司 | Shopping behavior prediction method and device |
Non-Patent Citations (2)
Title |
---|
UTKIN, LEV V.,RYABININ, MIKHAIL A: "《Knowledge-Based Systems》", 27 April 2017 * |
ZHI-HUA ZHOU, JI FENG: "Deep Forest: Towards An Alternative to Deep Neural Networks", 《TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE》 * |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108416623A (en) * | 2018-02-27 | 2018-08-17 | 苏州竹语网络科技有限公司 | Information recommendation method and device |
CN110490625A (en) * | 2018-05-11 | 2019-11-22 | 北京京东尚科信息技术有限公司 | User preference determines method and device, electronic equipment, storage medium |
CN109272534A (en) * | 2018-05-16 | 2019-01-25 | 西安电子科技大学 | SAR image change detection based on more granularities cascade forest model |
CN109272534B (en) * | 2018-05-16 | 2022-03-04 | 西安电子科技大学 | SAR image change detection method based on multi-granularity cascade forest model |
CN109785034A (en) * | 2018-11-13 | 2019-05-21 | 北京码牛科技有限公司 | User's portrait generation method, device, electronic equipment and computer-readable medium |
CN110490114A (en) * | 2019-08-13 | 2019-11-22 | 西北工业大学 | Target detection barrier-avoiding method in a kind of unmanned plane real-time empty based on depth random forest and laser radar |
CN110619585A (en) * | 2019-08-16 | 2019-12-27 | 广州越秀金融科技有限公司 | Method, device, storage medium and processor for recommending data |
CN112685618A (en) * | 2019-10-17 | 2021-04-20 | 中国移动通信集团浙江有限公司 | User feature identification method and device, computing equipment and computer storage medium |
CN111324733A (en) * | 2020-02-07 | 2020-06-23 | 北京创鑫旅程网络技术有限公司 | Content recommendation method, device, equipment and storage medium |
CN111652278A (en) * | 2020-04-30 | 2020-09-11 | 中国平安财产保险股份有限公司 | User behavior detection method and device, electronic equipment and medium |
CN111652278B (en) * | 2020-04-30 | 2024-04-30 | 中国平安财产保险股份有限公司 | User behavior detection method, device, electronic equipment and medium |
CN114048392A (en) * | 2022-01-13 | 2022-02-15 | 北京达佳互联信息技术有限公司 | Multimedia resource pushing method and device, electronic equipment and storage medium |
CN114169523A (en) * | 2022-02-10 | 2022-03-11 | 一道新能源科技(衢州)有限公司 | Solar cell use data analysis method and system |
CN116451087A (en) * | 2022-12-20 | 2023-07-18 | 石家庄七彩联创光电科技有限公司 | Character matching method, device, terminal and storage medium |
CN116451087B (en) * | 2022-12-20 | 2023-12-26 | 石家庄七彩联创光电科技有限公司 | Character matching method, device, terminal and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN107590224B (en) | 2021-11-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107590224A (en) | User preference analysis method and device based on big data | |
Ngo et al. | Factor-based big data and predictive analytics capability assessment tool for the construction industry | |
CN111026842B (en) | Natural language processing method, natural language processing device and intelligent question-answering system | |
CN110162593A (en) | A kind of processing of search result, similarity model training method and device | |
CN109192242A (en) | Based on the microbial diversity interaction analysis system and method for calculating cloud platform | |
CN107622427A (en) | The method, apparatus and system of deep learning | |
CN105139237A (en) | Information push method and apparatus | |
CN112396108A (en) | Service data evaluation method, device, equipment and computer readable storage medium | |
CN110688454A (en) | Method, device, equipment and storage medium for processing consultation conversation | |
CN110825956A (en) | Information flow recommendation method and device, computer equipment and storage medium | |
CN106709017A (en) | Big data-based aid decision making method | |
CN103412903B (en) | The Internet of Things real-time searching method and system predicted based on object of interest | |
CN115547466B (en) | Medical institution registration and review system and method based on big data | |
CN107832426A (en) | A kind of APP recommendation method and system based on using sequence context | |
CN111695024A (en) | Object evaluation value prediction method and system, and recommendation method and system | |
CN108647729A (en) | A kind of user's portrait acquisition methods | |
Ramkumar et al. | A survey on mining multiple data sources | |
CN114253990A (en) | Database query method and device, computer equipment and storage medium | |
CN110188207A (en) | Knowledge mapping construction method and device, readable storage medium storing program for executing, electronic equipment | |
CN109815309A (en) | A kind of user information recommended method and system based on personalization | |
Silva et al. | Collaboration as a driving factor for hit song classification | |
CN117786086A (en) | Reply text generation method, reply text generation device, computer equipment and readable storage medium | |
CN114330482A (en) | Data processing method and device and computer readable storage medium | |
CN110532448A (en) | Document Classification Method, device, equipment and storage medium neural network based | |
CN111177653A (en) | Credit assessment method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |