CN108629633A

CN108629633A - A kind of method and system for establishing user's portrait based on big data

Info

Publication number: CN108629633A
Application number: CN201810438144.3A
Authority: CN
Inventors: 张铁舰; 付安龙
Original assignee: Inspur Software Co Ltd
Current assignee: Inspur Software Co Ltd
Priority date: 2018-05-09
Filing date: 2018-05-09
Publication date: 2018-10-09

Abstract

The invention discloses a kind of method and system for establishing user's portrait based on big data, belong to big data applied technical field.The method for establishing user's portrait based on big data of the present invention includes the following steps：S1：Build user's portrait label system.S2：Data are pre-processed.S3：Sample automatic marking.S4：The processing of user data sample imbalance.S5：Feature Engineering.S6：Model training.It is combined using more disaggregated models and two disaggregated models.S7：Model optimization.The method and system that user's portrait is established based on big data of the invention can improve user's portrait accuracy has good application value so as to build Personalized Intelligent Recommendation system, precision marketing and accurate advertisement.

Description

A kind of method and system for establishing user's portrait based on big data

Technical field

The present invention relates to big data applied technical fields, specifically provide a kind of method for establishing user's portrait based on big data And system.

Background technology

How effectively with the arrival in big data epoch, the user data of integration is more and more, and information content is increasing, profit With the data of accumulation, more accurate more valuable data information is obtained, and obtained valuable data information is passed through effective The exhibition method of labeling is presented, and then establishes accurate user's portrait, is present big data field problem encountered.Existing skill User's portrait is more the mode that manual intervention labels in art, and labor intensive is more, and it is main with the person of labelling to be labeled with label Preference gender gap is big, and label accuracy causes anxiety.

Invention content

The technical assignment of the present invention is that in view of the above problems, user's portrait accuracy can be improved by providing one kind, So as to build the side for establishing user's portrait based on big data of Personalized Intelligent Recommendation system, precision marketing and accurate advertisement Method.

The further technical assignment of the present invention is to provide a kind of system for establishing user's portrait based on big data.

To achieve the above object, the present invention provides following technical solutions：

A method of user's portrait is established based on big data, the described method comprises the following steps：

S1：Build user's portrait label system

User data is normalized to the label system of target effective, label is divided into structuring and unstructured, structured tag There is clear level association father and son's classification relation, label is regular, and unstructured label does not have hierarchical relationship, label dispersion；

S2：Data are pre-processed；

S3：Sample automatic marking

Using sample semi-supervised learning automatic marking；

S4：The processing of user data sample imbalance

For data Layer sample carry out over-sampling or lack sampling processing, for algorithm layer sample carry out cost-sensitive and Integrated study processing；

S5：Feature Engineering

Sample set structure is completed, and feature is extracted from sample, according to specific data type, does tagsort；

S6：Model training

It is combined using more disaggregated models and two disaggregated models；

S7：Model optimization

Analysis model is over-fitting or poor fitting, and is optimized to model.

User's portrait is exactly the user model of the labeling gone out according to the data abstraction of user, i.e. user data label Change, that is, refines data and tagged.After data prediction, sample is extracted, to user's portrait tag modeling, knot Close machine learning, deep learning include the methods of deeply study, transfer learning, natural language processing continue to optimize as a result, Improve user's portrait accuracy.

Model training process needs are trained by many models, then find optimization.With user interest data For classification, class mesh number is numerous, and has father and son's hierarchical relationship, using model carry out more classification can not meet demand, can To use hierarchy model, in the assorting process, it is also necessary to consider the dependence between classification tree hierarchy node, and Classification problem inside level.Model is built with the structure of top-down hierarchy classification tree, is met between hierarchy node Dependence.Classification problem inside level, the structure that disaggregated model more than one or multiple two disaggregated models may be used are come It realizes, based on considered below, the structure of multiple two disaggregated models may be used, classification is facilitated to extend, more people edit classification, single class Mesh extension optimization, does not influence other classifications, for classification cross-cutting issue, a sample is divided into multiple classifications.Classification is drawn When point not being universal class purpose situation, it should be noted that sample is labelled unjustifiably point outside field, but this structure, in the more situation of classification Under, the workload of bigger, n subordinate's classification can be brought to classify compared to level more, it can more n-1 models.So according to practical industry Situation of being engaged in and data cases are generally adopted by the structure that more disaggregated models and two disaggregated models combine.

Preferably, structured tag described in step S1 includes user property label, in short-term interest tags and long-term emerging Interesting label.

The psychological activity complex of people, sometimes also in contradictory state, the behavior of such people will complexity win the title, The behavioral data so generated is exactly complex and disperses, it would be desirable to arrive these data normalizations that are complicated and disperseing The label system of target effective.

User property label includes such as address name, gender, height, in algorithm layer weight meeting height.

Interest tags include the data as browsed commodity bed in short-term, are paid close attention to never again after can having bought, in algorithm mould Type layer weight can be according to time change rapid decay.

Long-term Interest label includes such as entertainment-cross-talk-Guo De guiding principle special shows.

Unstructured number of labels is huge, and dynamics dispersion is used as personalized labels.

Preferably, described in step S2 to data carry out pretreatment include user data collection, to the data of collection into Row cleaning.

The user data includes user behavior data, such as navigation patterns；The structural data of generation, such as commodity library, clear Look at web page library etc.；Knowledge data, such as bibliography system, data dictionary.Precision data in order to obtain carries out the data of collection clear It washes, includes filtering and the anti-cheating and unstructured data etc. of invalid data, noise data.

Preferably, sample semi-supervised learning automatic marking described in step S3 is by marking sample on a small quantity, to a large amount of The sample not marked is trained classification, and the higher sample of confidence level is added to training set.

Use Tri-training and CoForest, Tri-training that training data is divided into 3 parts in the present invention, 3 models of training, CoForest are used n grader, are ensured the difference between each grader using random forest.Tri- Training and CoForest can introduce the convergent condition control noise point sample of error rate, and can be determined by multiple graders Plan ballot carrys out the addition of less noisy samples.

Preferably, over-sampling described in step S4 is the classification performance for improving minority class by increasing positive sample, owe to adopt Sample is to reject negative sample；The cost-sensitive is to increase the weight of positive sample, reduces the weight of negative sample, and integrated study is by negative sample Originally it is divided into more parts, every part is trained with positive sample, obtains multiple models.

The simple positive sample that replicates just belongs to over-sampling, the disadvantage is that being easy to cause over-fitting, it is possible in positive sample It is random that Gaussian noise is added or generates new synthesis sample, SMOTE algorithms can be used.The random rejecting negative sample of lack sampling operates It is fairly simple, in actual classification, due to more than negative sample number, random lack sampling is such as carried out according to a certain percentage, it is real The effect that border generates can be relatively good.Other method of samplings also have Tomek links, NearMiss, One-Sided Selection Method etc. can be tested according to actual effect and be taken optimal.

Preferably, extracting characteristic procedure in step S5 carries out feature selecting, character subset, training pattern are found.

For the feature extraction of text word, feature, such as China or China etc. are used as in can extract.For other text classes Extension feature obtains corresponding browsing network address as feature, or calculate text similarity, extends phase using browsing data It is used as feature like.Specific area feature, the corresponding domain features of text key word, such as region, the type of merchandise, item property Deng.For theme feature, using topic models such as TopicModel, LDA, it regard the distribution of corresponding topic parameter as its feature.

Commonly used feature selection approach has：

TF-IDF：TF word frequency（Term Frequency）, the ability that describes document content for calculating the word.This number is pair Word number（term count）Normalization, to prevent him to be biased to long file.IDF（Inverse Document Frequency） Reverse document frequency, the ability for distinguishing document for calculating the word.The value of TF-IDF is bigger, illustrates differentiation of this feature to classification Ability is stronger.

Chi-square Test：Chi-square value is bigger, illustrates that this word is two stronger words of class discrimination degree.

Mutual information：Mutual information is generally used for the correlation between two words of measurement, can be used for calculating in feature selecting special Levy the discrimination to classification.

Information gain：Information gain is that the difference of front and back comentropy occur by calculating a certain feature, indicates this feature pair The importance of classification.

In the case where training sample amount is bigger, the practical effect generated of the above several method is similar, and power is calculated After weight, weight is ranked up, it generally as needed can be there are two types of selection mode：Select the maximum preceding K feature of weight or The great feature in some threshold value of person's right to choose.

Preferably, being directed to poor fitting in step S7, the accuracy rate of training set and test set is low, carries out data cleansing, increases Add validity feature, replaces complicated model.

Data cleansing is carried out for poor fitting, increases validity feature, the penalty coefficient of regular terms can also be reduced, model melts Close ballot.Complicated model is replaced as linear model, changed nonlinear model into.

Preferably, being directed to over-fitting in step S7, the accuracy rate of training set is high, and the accuracy rate of test set is low, is increased Add training sample data, replaces simple model.

For over-fitting treating method, the penalty coefficient of regular terms can also be improved, reduces repetitive exercise number.Replace letter Nonlinear model is such as changed to linear model by single model.

A kind of system for establishing user's portrait based on big data, including user's portrait label system construction module, data are pre- Processing module, sample automatic marking module, sample imbalance processing module, Feature Engineering module, model training module and Model optimization module, user draw a portrait label system construction module for building user's portrait label system, data preprocessing module For preprocessed data, sample automatic marking module is used for automatic marking sample, sample imbalance processing module for pair Sample imbalance is handled, and Feature Engineering module, according to specific data type, is spy for extracting feature from sample Sign classification, model training module are used for training pattern, and model optimization module is used for Optimized model.

Compared with prior art, the method for the invention for establishing user's portrait based on big data has following prominent beneficial Effect：The scheme that the whole series provided by the method for establishing user's portrait based on big data establish accurate user portrait can be with Quickly, efficiently, accurately structure user portrait, improve the accuracy of data label, Personalized Intelligent Recommendation built with this System, precision marketing and accurate advertisement, make full use of user data, have good application value.

Specific implementation mode

Below in conjunction with embodiment, the method and system that user's portrait is established based on big data of the present invention are made further It is described in detail.

Embodiment

The method for establishing user's portrait based on big data of the present invention, includes the following steps：

S1：Build user's portrait label system

User data normalizes to the label system of target effective, label is divided into structuring and unstructured.Structured tag There is clear level association father and son's classification relation, label is regular, and tree or forest shape can be presented.It is subdivided into use according to actual conditions Family attribute tags, in short-term interest tags and Long-term Interest label.User property label includes such as address name, gender, height, It is high in the meeting of algorithm layer weight.Interest tags include the data as browsed commodity bed in short-term, are paid close attention to never again after can having bought , algorithm model layer weight can be according to time change rapid decay.Long-term Interest label includes such as entertainment-cross-talk-Guo Moral guiding principle special show.

Unstructured label does not have hierarchical relationship, label dispersion, substantial amounts, dynamics dispersion, as personalized labels.

S2：Data are pre-processed

Pretreatment is carried out to data to include user data collection, clean the data of collection.User data includes user's row For data, such as navigation patterns；The structural data of generation, such as commodity library, browsing web page library；Knowledge data, as bibliography system, Data dictionary etc..Precision data in order to obtain cleans the data of collection, includes the filtering of invalid data, noise data And anti-cheating and unstructured data etc..

S3：Sample automatic marking

Sample marks heavy workload, especially under big data environment, and also has the shortcomings of such as subjective, randomness, the present invention It is middle to use sample semi-supervised learning automatic marking for by marking sample on a small quantity, the sample not marked largely is trained point The higher sample of confidence level is added to training set by class.

S4：The processing of user data sample imbalance

In the sample of sampling, it may appear that the case where sample imbalance, and the typically far smaller than negative sample of positive sample This number.It is divided into two levels in the present invention to illustrate.

First, data Layer, can carry out sample over-sampling or lack sampling is handled.Over-sampling is exactly by increasing positive sample To improve the classification performance of minority class, the simple positive sample that replicates just belongs to over-sampling, the disadvantage is that it is easy to cause over-fitting, institute Gaussian noise can be added at random in positive sample or generate new synthesis sample, SMOTE algorithms can be used；Lack sampling processing Simplest method is random rejecting negative sample, and such operational benefits are fairly simple, in certain actual classifications, due to negative Reason more than number of samples such as carries out random lack sampling according to a certain percentage, and the effect actually generated can be relatively good.

Second is that algorithm layer, mainly there is cost-sensitive and the method for integrated study.It can be in loss letter（loss function） The weight of middle adjustment penalty term, such as increases the weight of positive sample, reduces the weight of negative sample, and here it is code sensitivities.And it integrates The method of study, for example negative sample is divided into more parts, every part is all trained with positive sample, obtains multiple models, final vote Obtain result.

S5：Feature Engineering

Sample set structure is completed, and feature is extracted from sample, according to specific data type, does tagsort.For text word Feature extraction, feature, such as China or China etc. are used as in can extract.For other text class extension features, using clear It lookes at data, obtains corresponding browsing network address as feature, or calculate text similarity, extension similitude is as feature.It is specific Domain features, text key word corresponding domain features, such as region, the type of merchandise, item property etc..For theme feature, Using topic models such as TopicModel, LDA, it regard the distribution of corresponding topic parameter as its feature.

Extraction characteristic procedure also needs to carry out feature selecting, finds character subset, training pattern.Commonly used feature selecting Method has：

S6：Model training

Model training process needs are trained by many models, then find optimization.Classified with user interest data For, class mesh number is numerous, and has father and son's hierarchical relationship, using model carry out more classification can not meet demand, can make With hierarchy model, in the assorting process, it is also necessary to consider the dependence and level between classification tree hierarchy node Internal classification problem.Model is built with the structure of top-down hierarchy classification tree, meets the dependence between hierarchy node Relationship.Classification problem inside level, may be used the structure of disaggregated model more than one or multiple two disaggregated models to realize, Based on considered below, the structure of multiple two disaggregated models may be used, classification is facilitated to extend, more people edit classification, and single classification expands Exhibition optimization, does not influence other classifications, for classification cross-cutting issue, a sample is divided into multiple classifications.The division of classification is not When being universal class purpose situation, it should be noted that sample is labelled unjustifiably point outside field, but this structure, in the case where classification is more, meeting Bring the workload of bigger, n subordinate's classification to classify compared to level more, it can more n-1 models.So according to practical business feelings Condition and data cases are generally adopted by the structure that more disaggregated models and two disaggregated models combine.

S7：Model optimization

After being trained model, need to see that actual effect can meet anticipated demand, for it is undesirable need to carry out it is excellent Change, analysis model is over-fitting or poor fitting, to targetedly optimize.

For poor fitting processing, the accuracy rate of training set and test set is relatively low, and training pattern is also without inherence well Relationship, by optimizing to training sample, there is also some untreated clean noise samples in possible training sample, it is necessary to after It is continuous to carry out data cleansing, increase validity feature, turns down the penalty coefficient of regular terms, replace relative complex model, such as line Property model change nonlinear model into, Model Fusion ballot.

For over-fitting processing, the accuracy rate of training set is higher in training process, but test set accuracy rate is relatively low, passes through Increase training sample data, improve the penalty coefficient of regular terms, reduces repetitive exercise number, replace relatively simple model, such as Nonlinear model is replaced linear model.

Embodiment described above, the only present invention more preferably specific implementation mode, those skilled in the art is at this The usual variations and alternatives carried out within the scope of inventive technique scheme should be all included within the scope of the present invention.

Claims

1. a kind of method for establishing user's portrait based on big data, it is characterised in that：It the described method comprises the following steps：

S1：Build user's portrait label system

S2：Data are pre-processed；

S3：Sample automatic marking

Using sample semi-supervised learning automatic marking；

S4：The processing of user data sample imbalance

S5：Feature Engineering

S6：Model training

It is combined using more disaggregated models and two disaggregated models；

S7：Model optimization

Analysis model is over-fitting or poor fitting, and is optimized to model.

2. the method according to claim 1 for establishing user's portrait based on big data, it is characterised in that：Described in step S1 Structured tag includes user property label, in short-term interest tags and Long-term Interest label.

3. the method according to claim 1 or 2 for establishing user's portrait based on big data, it is characterised in that：In step S2 It is described to data carry out pretreatment include user data collection, the data of collection are cleaned.

4. the method according to claim 3 for establishing user's portrait based on big data, it is characterised in that：Described in step S3 Sample semi-supervised learning automatic marking is to be trained classification by marking sample on a small quantity to the sample not marked largely, will set The higher sample of reliability is added to training set.

5. the method according to claim 4 for establishing user's portrait based on big data, it is characterised in that：Described in step S4 Over-sampling is the classification performance that minority class is improved by increasing positive sample, and lack sampling is to reject negative sample；The cost-sensitive is The weight for increasing positive sample, reduces the weight of negative sample, negative sample is divided into more parts by integrated study, and every part is instructed with positive sample Practice, obtains multiple models.

6. the method according to claim 5 for establishing user's portrait based on big data, it is characterised in that：It is extracted in step S5 Characteristic procedure carries out feature selecting, finds character subset, training pattern.

7. the method according to claim 6 for establishing user's portrait based on big data, it is characterised in that：It is directed in step S7 The accuracy rate of poor fitting, training set and test set is low, carries out data cleansing, increases validity feature, replaces complicated model.

8. the method according to claim 7 for establishing user's portrait based on big data, it is characterised in that：It is directed in step S7 The accuracy rate of over-fitting, training set is high, and the accuracy rate of test set is low, carries out increasing training sample data, replaces simple model.

9. a kind of system for establishing user's portrait based on big data, it is characterised in that：Including user's portrait label system construction mould Block, data preprocessing module, sample automatic marking module, sample imbalance processing module, Feature Engineering module, model instruction Practice module and model optimization module, user draws a portrait label system construction module for building user's portrait label system, and data are pre- Processing module is used for preprocessed data, and sample automatic marking module is used for automatic marking sample, and sample imbalance handles mould Block is for handling sample imbalance, and Feature Engineering module from sample for extracting feature, according to specific data Type does tagsort, and model training module is used for training pattern, and model optimization module is used for Optimized model.