CN114358807A - User portrayal method and system based on predictable user characteristic attributes - Google Patents

User portrayal method and system based on predictable user characteristic attributes Download PDF

Info

Publication number
CN114358807A
CN114358807A CN202110327165.XA CN202110327165A CN114358807A CN 114358807 A CN114358807 A CN 114358807A CN 202110327165 A CN202110327165 A CN 202110327165A CN 114358807 A CN114358807 A CN 114358807A
Authority
CN
China
Prior art keywords
user
gender
characteristic
age group
age
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110327165.XA
Other languages
Chinese (zh)
Inventor
李永安
赵世亭
陈洪涛
邹建伟
何成俭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Xunze Network Technology Co ltd
Original Assignee
Shanghai Xunze Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Xunze Network Technology Co ltd filed Critical Shanghai Xunze Network Technology Co ltd
Publication of CN114358807A publication Critical patent/CN114358807A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a user portrait method and a system based on predictable user characteristic attributes, wherein the method comprises the following steps of S1: constructing a feature database of an internet user terminal, and acquiring a plurality of application programs installed on each user terminal; s2: acquiring a plurality of sample data, and marking the characteristics of the age group and the gender as labels respectively; s3: performing one-hot coding on the application program list in the labeling data based on the age group and the gender to obtain a characteristic matrix based on the age group and the gender; s4: training feature matrixes of age groups and genders by an XGboost algorithm, and obtaining a prediction model of the age groups and the genders through training; s5: and according to the age group/gender prediction model, predicting the age group and the gender of the user to obtain the age group and the gender of the user, and constructing and generating the user portrait based on the characteristic attribute of the user terminal. The invention predicts the gender and age of the user according to the application program installed in the user terminal, and constructs the user portrait according to the known label of the user terminal.

Description

User portrayal method and system based on predictable user characteristic attributes
Technical Field
The invention relates to the technical field of computer processing, in particular to a user portrait method and a user portrait system based on predictable user characteristic attributes.
Background
The user portrait is an effective tool for sketching the target user and connecting the appeal and the design direction of the user, and is widely applied to various fields. The process of constructing the user representation is essentially a process of describing a virtual user group in a short text (or adding pictures), which can be understood as abstracting user features into phrase tags, and the same phrase tags can be regarded as that the users have similar targets, requirements, behaviors and the like. There are two types of user portrait construction processes on the market: one is that product designers and operators abstract typical users from user groups according to user requirements, and the other is that label sets describing the users are generated according to data of behaviors, viewpoints and the like of each user in products and services. In particular, the user portrait is used as a set of tags (tags) for depicting user features, such as age, gender, and other static attributes, but may also include user interest features, such as travel, clothing, and the like. The construction and updating of the user portrait are significant for the subsequent directional propagation of information, such as the directional placement of advertisements.
The user portrait in the market is generally represented as a feature tag generated by behaviors of browsing, purchasing, using and the like of a user, that is, according to the behavior of the user in a station, for example, behavior logs such as accessed media categories and commodity categories are stored, then all the user behavior logs are traversed within a certain time window, and a weight decay function is calculated to obtain a user feature tag, for example, the feature tag includes: the user preference feature tag and the user basic attribute feature tag are obtained through statistical analysis and other modes based on the user preference feature tag obtained through content delivery analysis on the intelligent terminal, and most of data sources and corresponding tags of user basic attributes (gender, age, place of birth and the like) are predicted and estimated to obtain the current latest user portrait.
The inventor of the application finds that the method has the problem that the obtained data is very comprehensive, for example, the user data in the website is unknown about the visiting behavior of the user on other websites; meanwhile, for the internet industry, the real-name registration of the user is difficult to achieve, so that the static attributes of the user such as age, gender and the like can only be guessed through a related statistical algorithm, and the reliability is not high. User data of data holders of various existing application programs are basically closed, and data islands cannot be communicated and shared on the whole network level. This makes it impossible for data users to accurately understand the behavior preferences of users in the whole network range and to perform personalized data mining on data in accordance with their industry requirements, which makes it difficult to generate user portraits in line with industry applications.
Disclosure of Invention
The user portrait method and the system based on the predictable user characteristic attributes solve the problems of one-sidedness of user access data and data sealing between application programs in the prior art, achieve the technical effect of predicting the gender and the age of a user according to the application program installed in a user terminal, and further construct the user portrait according to the known label of the user terminal so as to implement the beneficial effect of accurate recommendation for each user terminal.
In a first aspect, an embodiment of the present application provides a user representation method based on a predictable user characteristic attribute, where the method includes:
s1: constructing a feature database of an internet user terminal, acquiring a plurality of application programs installed by each user terminal, assigning different user attribute weights for different application programs, performing category refinement on the reported user feature attributes including gender and age in each application program, automatically merging according to a preset label, and unifying the categories;
s2: acquiring a plurality of sample data in a user characteristic database, and marking the sample data respectively by taking the age group characteristics and the gender characteristics as labels;
s3: based on the age group characteristics, performing one-hot coding on the application program list in the marked data to obtain a characteristic matrix based on the age group; based on the gender characteristics, one-hot coding is carried out on the application program list in the marking data, and a characteristic matrix based on the gender is obtained; taking the feature matrix based on the age group and/or the feature matrix based on the gender as the feature input of a subsequent XGboost algorithm, thereby forming a certain amount of sample data for predicting the feature attribute of the user terminal;
s4: adopting an XGboost algorithm, improving the applicability, and training the model of the obtained sample data; training a feature matrix of the age group and a feature matrix of the gender through an XGboost algorithm, and obtaining a prediction model of the age group and the gender after training and maturation;
s5: according to the age group prediction model and the gender prediction model, the age group and the gender of the user are predicted for the application program installed in any user terminal, the age group and the gender of the user are obtained, meanwhile, a multi-dimensional interest tag of the user is formed on the basis of the characteristic attributes of the user including one or more of the region, the city grade, the terminal type and the consumption level, and a user portrait is constructed and generated so as to be recommended accurately for the user.
Further, in step S3, the method for obtaining the age-based feature matrix includes:
s31: performing one-hot coding on the application program of each user terminal, and acquiring a feature matrix (1) of the application programs installed on all the user terminals in the labeling data for the installed application program code 1 and the uninstalled application program code 0:
Figure BDA0002995074930000031
s32: the age group characteristics are substituted into the matrix (1) to obtain a characteristic matrix (2) which is expressed as follows:
Figure BDA0002995074930000032
s33: and substituting the user attribute weight based on the age group characteristic of each application program into the matrix (2) to obtain a characteristic matrix (3) based on the age group, wherein the characteristic matrix is represented as follows:
Figure BDA0002995074930000041
further, in step S3, the method for obtaining a gender-based feature matrix includes:
s34: the gender characteristics are substituted into the matrix (1) to obtain a characteristic matrix (4) which is expressed as follows:
Figure BDA0002995074930000042
s35: and substituting the user attribute weight based on the gender characteristic of each application program into the matrix (4) to obtain a gender-based characteristic matrix (5), which is represented as follows:
Figure BDA0002995074930000043
further, in step S4, the method for training the feature matrix of the age group and the feature matrix of the gender through the XGBoost algorithm to obtain the prediction model of the age group and the gender includes:
s41: receiving a labeled sample data set, wherein the sample data set comprises n user terminals, each user terminal comprises m characteristic data, and the sample data set is represented as: d { (x)i,yi)}(|D|= n,xi∈Rm,yi∈R;
S42: performing model training of an XGboost algorithm on a feature matrix of an age group and/or a feature matrix of a gender, training k trees in the XGboost algorithm, and finally obtaining k CART decision trees through the training of the XGboost algorithm: f. ofk(x)=ωq (x)Adding the obtained integrated models, and using the integrated models as prediction models of age group and gender, wherein the output of the models represents the prediction of age group or gender, and the prediction models are expressed as:
Figure BDA0002995074930000051
Figure BDA0002995074930000052
representing the output of the XGboost model, wherein gamma is a regression tree space, representing the set of CART decision trees, and defined as: Γ ═ f (x) ═ ωa(x)}(q:Rm→T,ω∈RT);
Wherein, f (x) represents a CART decision tree, which is composed of a tree structure q and T leaf nodes, q represents the index of each tree structure mapped to the corresponding leaf node, x represents a certain user terminal, q (x) represents the leaf node where the user terminal is located, each leaf node has a continuous value and the weight omega, omega of the corresponding leaf node is called as the weight omega, omega of the leaf nodeq (x)Representing the value of each terminal user for the output of the regression tree to the sample data, and taking each f as a predicted valuekCorresponding to an independent tree structure q and the leaf node weight omega thereof, wherein the weight vector omega of the tree formed by all the weights belongs to RT
Tree structure q throughDistinguishing the characteristic attribute of the user terminal, and mapping any sample data with m-dimensional characteristic attribute to one of leaf nodes; each decision tree function fkCorresponding to a specific tree structure q and a corresponding leaf node weight vector omega; for one sample data, the XGBoos model obtains a final predicted value
Figure BDA0002995074930000053
The process comprises the following steps: mapping the sample to a corresponding leaf node on each decision tree, and then adding the weights of k leaf nodes corresponding to the sample;
the XGboost algorithm adopts an XGboost objective function as follows:
Figure BDA0002995074930000054
measuring the deviation between a predicted value and a true value of the model by using the XGboost objective function, and enabling an objective function value to be as small as possible by using the XGboost objective function in the training process;
Figure BDA0002995074930000055
indicates the predicted value, yiRepresents a target value; l represents a training objective function for measuring the predicted value
Figure BDA0002995074930000056
And a target value yiThe deviation between the two is a regular term, and is used for controlling the complexity of model training, and is defined as:
Figure BDA0002995074930000057
omega represents a model complexity penalty item, gamma and lambda represent penalty coefficients, and T represents the number of leaf nodes; | ω | non-calculation2The square of the output score at the leaf node of each tree is expressed, which is equivalent to being normalized to L2.
Further, before the step S3, after receiving the label data, based on the application program list in the label data, screening out N application programs with the top number of application programs, where N is greater than or equal to 3000, and recording the user terminals where 5 or more application programs are installed, as the user terminals capable of implementing one-hot coding.
In a second aspect, an embodiment of the present application provides a system of the user representation method based on a predictable user characteristic attribute in any one of the first aspect, including:
the database module is configured to construct a characteristic database of the internet user terminal, acquire application programs installed by each user terminal, assign different user attribute weights for different application programs, perform category refinement on the reported user characteristic attributes including gender and age in each application program, automatically merge according to tags, and unify categories;
the preprocessing module is configured to acquire a plurality of sample data in the user characteristic database, and label the sample data respectively by taking the age group characteristics and the gender characteristics as labels;
the characteristic coding module is configured to perform one-hot coding on the application program list in the labeling data based on the age group characteristics to obtain a characteristic matrix based on the age group; based on the gender characteristics, performing one-hot coding on the application program list in the labeling data to obtain a gender-based characteristic matrix; taking the feature matrix based on the age group and/or the feature matrix based on the gender as the feature input of a subsequent XGboost algorithm, thereby forming a certain amount of sample data for predicting the feature attribute of the user terminal;
the model training module is configured to adopt an XGboost algorithm, improve the applicability and train the acquired sample data in a model; inputting the characteristic matrix of the age group and the characteristic matrix of the gender into a preset XGboost model, and obtaining a prediction model of the age group and the gender after training and maturation;
and the portrait generation module is configured to predict the age and gender of the user according to the age prediction model and the gender prediction model of the application program installed in any user terminal, obtain the age and the gender of the user, form a multi-dimensional interest tag of the user based on the user characteristic attributes including one or more of region, city grade, terminal type and consumption level, and construct and generate a portrait of the user so as to accurately recommend the user.
In a third aspect, an embodiment of the present application provides an electronic device, including:
one or more memories for storing executable program code;
a processor coupled to the memory for storing one or more programs;
by reading the executable program code, running a computer program corresponding to the executable program code to perform a user portrayal method as described in any of the first aspects based on predictable user characteristic attributes.
In a fourth aspect, an embodiment of the present application provides a storage medium, including: executable program code is stored, and is read by at least one processor to run a computer program corresponding to the executable program code to perform the user portrayal method based on predictable user characteristic attributes according to any one of the first aspect.
The technical solutions provided in the embodiments of the present application have at least the following technical effects:
the method and the system further analyze the user characteristic attributes including age group and gender by combining the characteristics of the user such as the region, city grade, machine type, user behavior and the like, such as consumption level and the like, to generate the user portrait, and provide a good data basis for various aspects such as precise marketing, personalized service, network public opinion management and the like.
2, the invention utilizes the idea of the lifting tree algorithm in the XGboost model to be applied to the prediction of age groups and sex groups, thereby improving the accuracy of the prediction and generating the user portrait more accurately.
3, acquiring the user terminal and the corresponding application program through the networking action of each user terminal by the user terminal based on the user terminal in the Internet, thereby acquiring an application program installation list reported by the user terminal, and acquiring the region, the city grade, the machine type, the user consumption level and the like of the user terminal; and acquiring top N APPs which cover the most users, labeling the age stage, the gender and the like, and performing data preprocessing.
4, the invention obtains an age group prediction model and a gender prediction model by using the data of the APP list used by the user with the marked age group and gender and training according to the existing data by using the XGboost algorithm, and can predict the age group and the gender according to the model under the condition of unknown age and gender when obtaining the related data, thereby providing a good basis for accurate user portrait.
And 5, performing relevant user analysis by combining the predicted age and gender with relevant information provided by the user, generating a plurality of labels for the user, and constructing the user portrait required by different scenes.
Drawings
FIG. 1 is a flowchart illustrating a user representation method based on predictable user characteristic attributes according to an embodiment of the present disclosure;
FIG. 2 is a schematic diagram of the same user installing an application developed by the same software company according to an embodiment of the present application;
FIG. 3 is a flowchart of a user representation system based on predictable user characteristic attributes according to a second embodiment of the present application.
Detailed Description
The background art is further supplemented by the fact that building a user representation can be based on different mechanisms.
A user portrait construction method based on design and thinking comprises the following steps: through modes of questionnaires, interviews and the like, the commonalities and differences of users are known, and different user images are analyzed and designed. In view of the inherent specificity and complexity of the construction method, research usually proposes specific construction processes from 4 perspectives: a target guidance view, a role guidance view, a participation guidance view, and a fictitious guidance view. The target guide visual angle, the role guide visual angle and the participation guide visual angle have the common characteristic that the construction process is based on data, while the virtual structure guide visual angle is not based on data and only depends on intuition and assumption of designers. The target guide visual angle can intensively design ideas and is used as a communication tool to finish discussion, but the target guide visual angle underestimates the value of user participation.
The user portrait construction method based on the ontology or the concept comprises the following steps: the ontology-based method considers the prior documents, application programs, ontologies and the like related to the fields of user context information and summary information, researches a user portrait method in an academic paper recommendation system, users are represented as ontology based on a main body one by one, and articles are converted into corresponding word vectors and values to be matched. A concept-based approach builds a concept-based user representation based on search engine logs and click-throughs, and using a rank learning approach, where the concept vectors have both positive and negative values. However, the method has the limitations that corresponding body structures are required to be respectively constructed in different fields, experts are required to participate, time and labor are wasted, and the cost is high.
A user portrait construction method based on a main body or topic is a theme construction method, and the topics of a user are classified to represent the user by using algorithms such as a naive Bayes method and an LDA method. The topic-based construction method is characterized in that relevant texts and topic feature information of users are utilized for construction, people of the same interest topic are gathered together to obtain information of people of the same type, people labels of the people of the type are extracted by means of an algorithm, but the characteristics of the users cannot be really introduced into a model, and deviation may exist when positive emotion and negative emotion are analyzed.
The construction method based on interest or preference utilizes information frequently browsed or concerned by a user to depict the user, can fully utilize similar interest preferences of different users to summarize a plurality of user portrait construction methods aiming at the user interest, and provides two types of information based on description information of goods liked by the user and interaction duration of the user to specific goods on the basis, the method for constructing the user portrait dynamically adjusts the user preference according to the search behavior of the user, simultaneously uses weighted vocabularies to represent the preference summary of the user, and tests the validity of the dynamics in a search system, but can not well process the dynamics and the real-time performance in the user interest and hobbies; the interests and preferences of the user can only be estimated from a limited historical behavior.
The construction method based on multi-dimension or fusion is characterized in that a user is depicted from multiple dimensions through data of multiple feature types, and meanwhile, a user portrait is constructed by utilizing text features and social features and age classification research is conducted. Rather than relying solely on a database of names, features are extracted from user names and assigned to a gender using these feature names, a new name classifier is proposed for the user gender inference problem of some platforms. The existing researchers provide a multi-view fusion framework to construct a user portrait, utilize two channels to respectively model and predict users with different characteristics, and provide a plurality of feature extraction methods aiming at different types of data to construct a multi-source feature system (including categories, clustering features, numerical features and the like) of the users, but the calculation amount is large and complex.
Based on various defects of the user portrait construction method, the application provides a user portrait method based on a user characteristic attribute capable of being predicted. For better understanding of the technical solutions, the technical solutions will be described in detail below with reference to the drawings and specific embodiments.
Example one
Referring to fig. 1, an embodiment of the present application provides a user imaging method based on a predictable user characteristic attribute, which specifically includes the following steps:
step S1: the method comprises the steps of constructing a feature database of the internet user terminal, obtaining a plurality of application programs installed by each user terminal, appointing different user attribute weights for different application programs, carrying out category refinement on user feature attributes including gender and age reported in each application program, then automatically merging according to a preset label, and unifying the categories.
Referring to fig. 2, for how to obtain the application programs on the respective terminals, the embodiment of the present application is based on multiple application programs developed by a software company, since the same user may install multiple application programs developed by the software company. It can therefore be understood that the software company has different databases for the content of the application and the user data, respectively, such as { application 1 database, application 2 database, application 3 database, …, user data }. Further, when any user installs an application program, popup display is generated to require user authorization, the authorization confirmation action is that some related information of the terminal can be acquired through installation of the application program, for example, chat friend-making social software is installed to be light and sweet, after the user downloads and installs the software, popup display user protocols, privacy terms and permission specifications are provided, and under the condition that user installation is allowed, a software company collects the required information. According to the method and the device, the user database of the software company is utilized, namely the user information required by the server of each application program is integrated, and the advertisement is conveniently put by utilizing the technology of the embodiment of the application. And then after the user classification is utilized, high-value users are screened out, and advertisement putting optimization is facilitated.
According to the embodiment, based on the analysis of mass user terminal data in open source big data, the gender and age of a user to which the user terminal belongs are predicted according to application programs installed in each user terminal, and the user is accurately recommended on the user terminal by combining other related attributes of the user terminal. The other related attributes may be, but are not limited to, a region, a city class, a terminal type, and a consumption level, and may be understood as other user characteristic attributes determined further according to the user terminal ID after predicting the gender and age class of the user terminal according to the registered application program on the user terminal, or may be understood as user characteristic attributes obtained by acquiring an unregistered application program according to a known user terminal ID. Further, each ue displays a data structure of { ue ID, APP ═ APP1, APP2, APP3, …, APPN }, age ID, gender, region, city class, terminal type, consumption level … … }. Of course, applications include, but are not limited to, APP, WeChat applets.
Different user attribute weights are specified for different applications in this step. It can be understood that each application program locates a corresponding relatively stable user group according to the user requirements at the initial development stage. For a stable user population, the various attributes of the user should be a stable number of values over the entire user population. For example, in a user population of an application, a male user accounts for 56% of all users, while a female user accounts for 44% of all users. Based on this embodiment, a fixed user attribute weight is determined for each application. Of course, this determination may be an empirical determination or may be based on actual statistical calculations for a population of users. Further, the user attribute weight in this embodiment includes, but is not limited to, a gender weight and an age group weight. The gender weight refers to the ratio of the male user to the female user in the whole user group. The age group weight refers to the proportion of users in different age groups in the whole user group. The user attribute weights specified by a particular application are embodied in the following data structure: { application ID, application name, male user proportion, female user proportion, age group 1 user proportion, age group 2 user proportion … … }.
Step S2: and acquiring a plurality of sample data in the user characteristic database, and marking the sample data respectively by taking the age group characteristics and the gender characteristics as labels.
The step of labeling with the age data and the gender data as tags further comprises segmenting the age data, and dividing the age data into a plurality of categories by segmentation, preferably dividing the age data into 6 categories to obtain 6 types of age data: 0-17 years old, 18-24 years old, 25-34 years old, 35-44 years old, 45-55 years old, and over 55 years old. Step S1 is preceded by classifying the gender data: male and female. The age data segments are shown in table 1 below,
age group ID Description of the invention
AGE1 0-17 years old
AGE2 18-24 years old
AGE3 25-34 years old
AGE4 35-44 years old
AGE5 45-55 years old
AGE6 Over 55 years old
TABLE 1
In this step, during sample data preprocessing, the age characteristics and gender characteristics in the sample data are extracted to segment label each age characteristic, the age class type is labeled according to the age characteristics, the gender characteristics are labeled as male or female, each sample data can have the following data structure { user terminal ID, AGEID, gender, APPID … }, it can be seen that each sample data includes the age class characteristics and the gender characteristics, the application program corresponding to the age class characteristics can be respectively extracted, the application program corresponding to the gender characteristics is based on the gender characteristics, each labeled sample has the following data structure { user terminal ID, AGEID, APP1, APP2, APP3, …, APPN }, { user terminal ID, gender, APP1, APP2, APP3, …, APPN }.
Step S3: based on the age group characteristics, performing one-hot coding on the application program list in the marked data to obtain a characteristic matrix based on the age group; based on the gender characteristics, performing one-hot coding on the application program list in the labeling data to obtain a gender-based characteristic matrix; and (3) taking the feature matrix based on the age group and/or the feature matrix based on the gender as the feature input of a subsequent XGboost algorithm, thereby forming a certain amount of sample data for predicting the feature attribute of the user terminal.
Before step S3, after receiving the annotation data, based on the application program list in the annotation data, screening out N application programs with the top number of application programs, where N is greater than or equal to 3000, and recording the user terminals where 5 or more application programs are installed, as the user terminals capable of implementing one-hot coding.
Each user terminal is further statistically combined as shown in the data structure { application ID, user terminal ID }, so as to screen out a plurality of top N applications, which is not limited to this, but may also include an application ID and an application name.
The user terminals for installing 5 or more applications are recorded, certainly not limited to 5, in this embodiment, after the experimental test, the parameter is preferably 5, and when at least 5 user terminals exist in any user terminal, the user can be subjected to the sex and age prediction. The user terminal implementing the prediction may have the following data structure { user terminal ID, application 1, application 2, application 3, … }.
Further, in step S3, the method for obtaining the age-based feature matrix includes:
s31: performing one-hot coding on the application program of each user terminal, and acquiring a feature matrix (1) of the application programs installed on all the user terminals in the labeling data for the installed application program code 1 and the uninstalled application program code 0:
Figure BDA0002995074930000121
s32: the age group characteristics are substituted into the matrix (1) to obtain a characteristic matrix (2) which is expressed as follows:
Figure BDA0002995074930000131
s33: and substituting the user attribute weight based on the age group characteristic of each application program into the matrix (2) to obtain a characteristic matrix (3) based on the age group, wherein the characteristic matrix is represented as follows:
Figure BDA0002995074930000132
for example, a user terminal 1, AGE 26 marked on an installed application, belongs to AGE3 and is 25-34 years old, the applications installed on the user terminal are application 1, application 2, … and application N, and each application code {1,1,0, …, 0} may further be expressed as { user1, 1,1,0, …,0 }. In the feature matrix of the applications in the corresponding age group, the installed application code is 1, and the uninstalled application code is 0. And the application programs of other user terminals in all age groups are analogized. In a preset feature matrix, the weight of each application program is 1 by default and is adjusted according to a specific scene. If the weight of the application 1 in the application list in this step in the feature matrix is 0.5, then multiply by 0.5 at the corresponding code, and so on for other applications. The feature matrix for the corresponding age group is as follows:
Figure BDA0002995074930000133
further as follows:
Figure BDA0002995074930000141
in this step, if the weight of APP1 in the application program in the feature matrix is 0.5, the feature matrix becomes:
Figure BDA0002995074930000142
further conversion yields:
Figure BDA0002995074930000143
in step S3, the method of obtaining a gender-based feature matrix includes:
s34: the gender characteristics are substituted into the matrix (1) to obtain a characteristic matrix (4) which is expressed as follows:
Figure BDA0002995074930000144
s35: and substituting the user attribute weight based on the gender characteristic of each application program into the matrix (4) to obtain a gender-based characteristic matrix (5), which is represented as follows:
Figure BDA0002995074930000145
further, based on gender category classification, one-hot coding is carried out on the application program list in the labeling data, and a characteristic matrix of gender category is obtained. Based on the installed applications in each gender category, for example, the terminal ID of the user a is 1, the terminal application is marked with a male morph, the terminal applications are application 1, application 2, and application 3 … …, respectively, the installed application code is 1, and the uninstalled application code is 0 in the feature matrix corresponding to the application in the gender category. And in the feature matrix where the gender category is located, if the weight default of each application program is 1, the feature matrix corresponding to the gender category is adjusted according to a specific scene as follows:
Figure BDA0002995074930000151
step S4: performing model training on the obtained sample data by adopting an XGboost algorithm and improving the applicability; training a feature matrix of the age group and a feature matrix of the gender through an XGboost algorithm, and obtaining a prediction model of the age group and the gender through mature training.
Further, in step S4, the method for training the feature matrix of the age group and the feature matrix of the gender group through the XGBoost algorithm to obtain the prediction model of the age group and the gender group includes:
s41: receiving a labeled sample data set, wherein the sample data set comprises n user terminals, each user terminal comprises m characteristic data, and the sample data set is represented as: d { (x)i,yi)}(|D|= n,xi∈Rm,yi∈R。
S42: performing model training of an XGboost algorithm on a feature matrix of an age group and/or a feature matrix of a gender, training k trees in the XGboost algorithm, and finally obtaining k CART decision trees through the training of the XGboost algorithm: f. ofk(x)=ωq (x)Adding the obtained integrated models, and using the integrated models as prediction models of age bracket and gender, wherein the output of the models represents the prediction of age bracket or gender, and the prediction models are expressed as:
Figure BDA0002995074930000152
Figure BDA0002995074930000153
representing the output of the XGboost model, wherein gamma is a regression tree space, representing the set of CART decision trees, and defined as: Γ ═ f (x) ═ ωq(x)}(q:Rm→T,ω∈RT);
Wherein, f (x) represents a CART decision tree, which is composed of a tree structure q and T leaf nodes, q represents the index of each tree structure mapped to the corresponding leaf node, x represents a certain user terminal, q (x) represents the leaf node where the user terminal is located, each leaf node has a continuous value and the weight omega, omega of the corresponding leaf node is called as the weight omega, omega of the leaf nodeq (x)Representing the value of each terminal user for the output of the regression tree to the sample data, and taking each f as a predicted valuekCorresponding to an independent tree structure q and the leaf node weight omega thereof, wherein the weight vector omega of the tree formed by all the weights belongs to RT
Wherein, the tree structure q is distinguished by the characteristic attribute of the user terminal to have any valueMapping the sample data with m-dimensional characteristics to one of the leaf nodes; each decision tree function fkCorresponding to a specific tree structure q and a corresponding leaf node weight vector omega; for one sample data, the XGBoos model obtains a final predicted value
Figure BDA0002995074930000161
The process comprises the following steps: and mapping the sample to a corresponding leaf node on each decision tree, and then adding the weights of k leaf nodes corresponding to the sample.
The XGboost algorithm adopts an XGboost objective function as follows:
Figure BDA0002995074930000162
measuring the deviation between a predicted value and a true value of the model by using the XGboost objective function, and enabling an objective function value to be as small as possible by using the XGboost objective function in the training process;
Figure BDA0002995074930000163
indicates the predicted value, yiRepresents a target value; l represents a training objective function for measuring the predicted value
Figure BDA0002995074930000164
And a target value yiThe deviation between the two is a regular term, and is used for controlling the complexity of model training, and is defined as:
Figure BDA0002995074930000165
omega represents a model complexity penalty item, gamma and lambda represent penalty coefficients, and T represents the number of leaf nodes; | ω | non-calculation2The square of the output score at the leaf node of each tree is expressed, which is equivalent to being normalized to L2. It can be seen that the penalty term in the XGboost model corresponds to the model complexity, the number of nodes of each tree and the output sum of squares of the scores of the subnodes of each leaf are considered, and the penalty term is more convenient for optimization of parallelization calculation.
Further, the characteristic matrix of the age group is input into a preset XGboost model, and the age group prediction model is obtained through training. And when the characteristic data of the labeled age group is subjected to age group prediction model training, the age group is output. Wherein, the output age classification result comprises 6 types: and the classification result is obtained through the probability values of the age groups output by the model, namely the probability value of which age group is output to be the maximum, and the output is represented as the corresponding age group type.
In the step, a gender feature matrix is trained through an XGboost algorithm, and a gender prediction model is obtained through training. And outputting the gender types when carrying out gender prediction model training on the characteristic data marked with the gender. When the probability value of the outputted gender category is greater than 0.4593821, it indicates that the outputted gender category is female, and when the probability value is less than the value, it indicates that the gender category is male, and other outputs are unknown.
In this step, how to train by using the XGboost model to obtain the predicted value is further described, and the analysis is as follows:
the XGboost model is an efficient implementation of a gradient lifting tree (GBDT), and can also be understood as a set of machine learning system with extensible lifting trees. In the XGboost model, one tree is generated by continuously performing feature splitting, one tree is learned in each round, and the residual error between the predicted value and the true value of the model in the previous round is fitted. And when the training is finished to obtain k trees, predicting the fraction of one predicted sample. According to the characteristics of the sample with the prediction, a corresponding leaf node is fallen in each tree, each leaf node corresponds to a score, and finally the scores corresponding to each tree are added up to be represented as the prediction value of the sample.
Further exemplifying: in the embodiment, assuming that a plurality of users in a family receive a plurality of labeled data, a user terminal of each user forms a sample data set, the XGboost model is used to predict the preference degree of a family for the electronic game, and the consideration is that younger users prefer the electronic game compared with older users; compared with the female, the male prefers the electronic game, so that the male can distinguish the adult from the child according to the age of the male, then distinguish the male from the female through the gender, and grade the preference degree of each individual to the electronic game one by one.
And further classifying each family member in the labeled user sample set by using a decision rule in the tree model, training to obtain a tree model 1 and a tree model 2, and adding scores on corresponding leaf nodes to obtain a final prediction result. The predicted score for a child is the sum of the scores of all nodes in the two tree models: 2+0.9 ═ 2.9. The same principle is applied to the prediction score of adults: -1+ (-0.9) ═ 1.9. Therefore, the child enjoys the electronic game to a degree of 2.9, and the old person is predicted to enjoy the electronic game to a degree of-1.9, thereby indicating that the child prefers to play the electronic game.
Further description of the invention
Figure BDA0002995074930000171
The first item in the formula is used for controlling the number of leaf nodes in the tree model; the second term is used to control the weight distribution of the leaf nodes; both the gamma and lambda parameters are used to adjust the ratio between the two components in the regularization term.
According to a defined XGboost target function, performing model training on an XGboost algorithm by using a training sample set; in the XGBoost algorithm, training is performed in a manner that a tree model is iteratively increased, that is, a CART decision tree function is added at each step in the training process, so that the loss function is further reduced. The XGboost model is trained in an accumulation sum mode, and algorithms such as Taylor quadratic expansion, column down sampling, split searching and the like are further used for optimizing the XGboost model, so that the training speed and precision are improved.
Furthermore, in this step, the application program list installed in the user terminal with the age group and the gender marked is used as the existing data for training, and the age group prediction model and the gender prediction model are respectively obtained, so that when the relevant data is obtained and the age and the gender are unknown, the age group and the gender can be respectively predicted according to the age group prediction model and the gender prediction model, and a good basis is provided for accurate user portrayal.
Step S5: according to the age group prediction model and the gender prediction model, the age group and the gender of a user are predicted for an application program installed in any user terminal, the age group and the gender of the user are obtained, meanwhile, a multi-dimensional interest tag of the user is formed on the basis of user characteristic attributes including regions, city grades, terminal types and consumption levels, and a user portrait is constructed and generated so as to be convenient for accurate recommendation of the user.
In the step, the user is depicted from multiple dimensions through data of multiple characteristic types in a multi-dimensional or fusion image construction method. Furthermore, the user portrait is generated by further analyzing the predicted age and gender of the user according to the characteristics of the user such as the region, the city grade, the model and the user behavior, such as the consumption level, and the like, so that a good data basis is provided for various aspects such as accurate marketing, personalized service, network public opinion governance and the like; and the idea of the lifting tree algorithm is combined, the age bracket and the gender can be predicted, the accuracy rate is improved, and the user portrait can be generated more accurately.
Example two
Referring to fig. 3, an embodiment of the present application provides a user representation system based on a predictable user characteristic attribute, to which the method in the first embodiment is applied, including:
the database module 100 is configured to construct an internet user terminal characteristic database, acquire application programs installed by user terminals, assign different user attribute weights to different application programs, perform category refinement on reported user characteristic attributes in the application programs, automatically merge the user characteristic attributes according to tags, and unify the categories;
the preprocessing module 200 is configured to acquire a plurality of sample data in the user characteristic database, and label the sample data respectively by using the age stage characteristic and the gender characteristic as labels;
the feature coding module 300 is configured to perform one-hot coding on the application program list in the labeling data based on the age group features to obtain a feature matrix based on the age group; based on the gender characteristics, one-hot coding is carried out on the application program list in the marking data, and a characteristic matrix based on the gender is obtained;
the model training module 400 is configured to input the feature matrix of the age group into a preset XGboost model, obtain an age group prediction model after training, input the feature matrix of the gender into the preset XGboost model, and obtain a gender prediction model after training;
the representation generation module 500 is configured to predict the age and gender of a user for an installed application program of any user terminal according to the age prediction model and the gender prediction model, obtain the age and gender of the user, form a multi-dimensional interest tag of the user based on user characteristic attributes including region, city grade, terminal type and consumption level, and generate a user representation so as to accurately recommend the user.
EXAMPLE III
An embodiment of the present application provides an electronic device, including: one or more memories for storing executable program code; a processor coupled to the memory for storing one or more programs; by reading the executable program code, running a computer program corresponding to the executable program code to perform the user portrayal method based on predictable user characteristic attributes as described in any one of the embodiments.
The embodiment of the application provides a storage medium, which stores executable program codes, and at least one processor reads the executable program codes to run a computer program corresponding to the executable program codes so as to execute the user portrayal method based on the attribute of the predictable user characteristic according to any one of the embodiments.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is intended to include such modifications and variations.

Claims (8)

1. A user representation method based on predictable user characteristic attributes, the method comprising:
s1: constructing a feature database of an internet user terminal, acquiring a plurality of application programs installed by each user terminal, assigning different user attribute weights for different application programs, performing category refinement on user feature attributes including gender and age reported in each application program, then automatically merging according to a preset label, and unifying the categories;
s2: acquiring a plurality of sample data in a user characteristic database, and marking the sample data respectively by taking the age group characteristics and the gender characteristics as labels;
s3: based on the age group characteristics, performing one-hot coding on the application program list in the marked data to obtain a characteristic matrix based on the age group; based on the gender characteristics, performing one-hot coding on the application program list in the labeling data to obtain a gender-based characteristic matrix; taking the feature matrix based on the age group and/or the feature matrix based on the gender as the feature input of a subsequent XGboost algorithm, thereby forming a certain amount of sample data for predicting the feature attribute of the user terminal;
s4: an XGboost algorithm is adopted, the applicability is improved, and model training is carried out on the obtained sample data; training a feature matrix of the age group and a feature matrix of the gender through an XGboost algorithm, and obtaining a prediction model of the age group and the gender after training and maturation;
s5: according to the prediction model of the age group and the gender, the age group and the gender of the user are predicted for the application program installed in any user terminal, the age group and the gender of the user are obtained, meanwhile, a multi-dimensional interest tag of the user is formed on the basis of the characteristic attributes of the user including one or more of the region, the city grade, the terminal type and the consumption level, and a user portrait is constructed and generated so as to accurately recommend the user.
2. The method for representing a user image based on characteristics of predictable users as claimed in claim 1, wherein in step S3, the method for obtaining the characteristic matrix based on age bracket comprises:
s31: performing one-hot coding on the application program of each user terminal, and acquiring a feature matrix (1) of the application programs installed on all the user terminals in the labeling data for the installed application program code 1 and the uninstalled application program code 0:
Figure FDA0002995074920000021
s32: the age group characteristics are substituted into the matrix (1) to obtain a characteristic matrix (2) which is expressed as follows:
Figure FDA0002995074920000022
s33: and substituting the user attribute weight based on the age group characteristic of each application program into the matrix (2) to obtain a characteristic matrix (3) based on the age group, wherein the characteristic matrix is represented as follows:
Figure FDA0002995074920000023
3. the method for user representation based on predictable user characteristic attributes according to claim 2, wherein the step S3, the method for obtaining the gender-based characteristic matrix comprises:
s34: the gender characteristics are substituted into the matrix (1) to obtain a characteristic matrix (4) which is expressed as follows:
Figure FDA0002995074920000031
s35: and substituting the user attribute weight based on the gender characteristic of each application program into the matrix (4) to obtain a gender-based characteristic matrix (5), which is expressed as follows:
Figure FDA0002995074920000032
4. the method for user representation based on predictable user characteristics as claimed in claim 1, wherein the step S4 of training the feature matrix of the age group and the feature matrix of the gender through the XGBoost algorithm to obtain the prediction model of the age group and the gender comprises:
s41: receiving a labeled sample data set, wherein the sample data set comprises n user terminals, each user terminal comprises m characteristic data, and the sample data set is expressed as: d { (x)i,yi)}(|D|=n,xi∈Rm,yi∈R);
S42: performing model training of an XGboost algorithm on a feature matrix of an age group and/or a feature matrix of a gender, training k trees in the XGboost algorithm, and finally obtaining k CART decision trees through the training of the XGboost algorithm: f. ofk(x)=ωq (x)Adding the obtained integrated models, and using the integrated models as prediction models of age bracket and gender, wherein the output of the models represents the prediction of age bracket or gender, and the prediction models are expressed as:
Figure FDA0002995074920000041
Figure FDA0002995074920000042
representing the output of the XGboost model, wherein gamma is a regression tree space, representing the set of CART decision trees, and defined as: Γ ═ f (x) ═ ωq(x)}(q:Rm→T,ω∈RT);
Wherein, f (x) represents a CART decision tree, which is composed of a tree structure q and T leaf nodes, q represents the index of each tree structure mapped to the corresponding leaf node, x represents a certain user terminal, q (x) represents the leaf node where the user terminal is located, each leaf node has a continuous value and the weight omega, omega of the corresponding leaf node is called as the weight omega, omega of the leaf nodeq (x)Representing the value of each terminal user for the output of the regression tree to the sample data, and taking each f as a predicted valuekCorresponding to an independent tree structure q and leaf node weight omega thereof, all weights form a weight vector omega of the tree, which belongs to the element RT
The tree structure q is distinguished through the characteristic attribute of the user terminal, and any sample data with m-dimensional characteristics is mapped to one leaf node; each decision tree function fkCorresponding to a specific tree structure q and a corresponding leaf node weight vector omega; for one sample data, the XGBoos model obtains a final predicted value
Figure FDA0002995074920000043
The process comprises the following steps: mapping the sample to a corresponding leaf node on each decision tree, and then adding the weights of k leaf nodes corresponding to the sample;
the XGboost algorithm adopts an XGboost objective function as follows:
Figure FDA0002995074920000044
measuring the deviation between a predicted value and a true value of the model by using the XGboost objective function, and enabling an objective function value to be as small as possible by using the XGboost objective function in the training process;
Figure FDA0002995074920000045
indicates the predicted value, yiRepresents a target value; l represents a training objective function for measuring the predicted value
Figure FDA0002995074920000046
And a target value yiThe deviation between the two is a regular term, and is used for controlling the complexity of model training, and is defined as:
Figure FDA0002995074920000051
omega represents a model complexity penalty item, gamma and lambda represent penalty coefficients, and T represents the number of leaf nodes; iiω iil2The square of the output score at the leaf node of each tree is expressed, which is equivalent to being normalized to L2.
5. The method as claimed in claim 1, wherein before step S3, the method further comprises, after receiving the labeled data, screening out N applications with the highest number of applications, N ≧ 3000, based on the list of applications in the labeled data, and recording the user terminal installed with 5 or more applications as the user terminal capable of implementing one-hot coding.
6. A system for applying the user portrayal approach of any one of claims 1-5 based on predictable user characteristic attributes, comprising:
the database module is configured to construct a characteristic database of the Internet user terminal, acquire application programs installed by each user terminal, assign different user attribute weights for different application programs, perform category refinement on the reported user characteristic attributes including gender and age in each application program, automatically merge the user characteristic attributes according to tags and unify the categories;
the preprocessing module is configured to acquire a plurality of sample data in the user characteristic database, and label the sample data respectively by taking the age group characteristics and the gender characteristics as labels;
the characteristic coding module is configured to perform one-hot coding on the application program list in the labeling data based on the age group characteristics to acquire a characteristic matrix based on the age group; based on the gender characteristics, performing one-hot coding on the application program list in the labeling data to obtain a gender-based characteristic matrix; taking the feature matrix based on the age group and/or the feature matrix based on the gender as the feature input of a subsequent XGboost algorithm, thereby forming a certain amount of sample data for predicting the feature attribute of the user terminal;
the model training module is configured to adopt an XGboost algorithm, improve the applicability and train the model of the acquired sample data; inputting the characteristic matrix of the age group and the characteristic matrix of the gender into a preset XGboost model, and obtaining a prediction model of the age group and the gender after training and maturation;
and the portrait generation module is configured to predict the age and gender of the user according to the age prediction model and the gender prediction model of the application program installed in any user terminal, obtain the age and the gender of the user, form a multi-dimensional interest tag of the user based on the user characteristic attributes including one or more of region, city grade, terminal type and consumption level, and construct and generate a portrait of the user so as to accurately recommend the user.
7. An electronic device, comprising:
one or more memories for storing executable program code;
a processor coupled to the memory for storing one or more programs;
executing a computer program corresponding to said executable program code by reading said executable program code to perform a user profiling method based on predictable user characteristic attributes as claimed in any of claims 1 to 5.
8. A storage medium having stored thereon executable program code, which is read by at least one processor to execute a computer program corresponding to the executable program code to perform a user portrayal method according to any one of claims 1 to 5 based on predictable user characteristic attributes.
CN202110327165.XA 2021-03-12 2021-03-26 User portrayal method and system based on predictable user characteristic attributes Pending CN114358807A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2021102713380 2021-03-12
CN202110271338 2021-03-12

Publications (1)

Publication Number Publication Date
CN114358807A true CN114358807A (en) 2022-04-15

Family

ID=81095974

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110327165.XA Pending CN114358807A (en) 2021-03-12 2021-03-26 User portrayal method and system based on predictable user characteristic attributes

Country Status (1)

Country Link
CN (1) CN114358807A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115081334A (en) * 2022-06-30 2022-09-20 支付宝(杭州)信息技术有限公司 Method, system, apparatus and medium for predicting age bracket or gender of user
CN115689626A (en) * 2022-10-31 2023-02-03 荣耀终端有限公司 User attribute determination method of terminal equipment and electronic equipment

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115081334A (en) * 2022-06-30 2022-09-20 支付宝(杭州)信息技术有限公司 Method, system, apparatus and medium for predicting age bracket or gender of user
CN115689626A (en) * 2022-10-31 2023-02-03 荣耀终端有限公司 User attribute determination method of terminal equipment and electronic equipment
CN115689626B (en) * 2022-10-31 2024-03-01 荣耀终端有限公司 User attribute determining method of terminal equipment and electronic equipment

Similar Documents

Publication Publication Date Title
CN111444428B (en) Information recommendation method and device based on artificial intelligence, electronic equipment and storage medium
Ni et al. Perceive your users in depth: Learning universal user representations from multiple e-commerce tasks
WO2020228514A1 (en) Content recommendation method and apparatus, and device and storage medium
WO2021203819A1 (en) Content recommendation method and apparatus, electronic device, and storage medium
CN111898031B (en) Method and device for obtaining user portrait
CN111291266A (en) Artificial intelligence based recommendation method and device, electronic equipment and storage medium
CN111966914B (en) Content recommendation method and device based on artificial intelligence and computer equipment
US11188830B2 (en) Method and system for user profiling for content recommendation
CN104268292B (en) The label Word library updating method of portrait system
EP4181026A1 (en) Recommendation model training method and apparatus, recommendation method and apparatus, and computer-readable medium
Piletskiy et al. Development and analysis of intelligent recommendation system using machine learning approach
Huang et al. Neural embedding collaborative filtering for recommender systems
CN113569129A (en) Click rate prediction model processing method, content recommendation method, device and equipment
CN114358807A (en) User portrayal method and system based on predictable user characteristic attributes
CN111429161B (en) Feature extraction method, feature extraction device, storage medium and electronic equipment
Zhang et al. SEMA: Deeply learning semantic meanings and temporal dynamics for recommendations
Yuen et al. An online-updating algorithm on probabilistic matrix factorization with active learning for task recommendation in crowdsourcing systems
US20210350202A1 (en) Methods and systems of automatic creation of user personas
CN116452263A (en) Information recommendation method, device, equipment, storage medium and program product
CN116823410B (en) Data processing method, object processing method, recommending method and computing device
US20230316106A1 (en) Method and apparatus for training content recommendation model, device, and storage medium
Obadić et al. Addressing item-cold start problem in recommendation systems using model based approach and deep learning
CN116956183A (en) Multimedia resource recommendation method, model training method, device and storage medium
Zheng et al. Personality-aware collaborative learning: models and explanations
CN113724044A (en) User portrait based commodity recommendation, apparatus, computer device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination