CN105701498B - User classification method and server - Google Patents
User classification method and server Download PDFInfo
- Publication number
- CN105701498B CN105701498B CN201511033392.2A CN201511033392A CN105701498B CN 105701498 B CN105701498 B CN 105701498B CN 201511033392 A CN201511033392 A CN 201511033392A CN 105701498 B CN105701498 B CN 105701498B
- Authority
- CN
- China
- Prior art keywords
- user
- users
- attribute
- initial
- labeling
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 41
- 238000013145 classification model Methods 0.000 claims abstract description 93
- 238000002372 labelling Methods 0.000 claims abstract description 82
- 238000012549 training Methods 0.000 claims description 62
- 230000003993 interaction Effects 0.000 claims description 20
- 230000002452 interceptive effect Effects 0.000 claims description 14
- 238000000605 extraction Methods 0.000 claims description 12
- 238000003860 storage Methods 0.000 claims description 9
- 238000012545 processing Methods 0.000 claims description 8
- 238000012216 screening Methods 0.000 claims description 6
- 238000005304 joining Methods 0.000 claims description 4
- 238000007477 logistic regression Methods 0.000 description 18
- 238000004422 calculation algorithm Methods 0.000 description 8
- 238000010586 diagram Methods 0.000 description 8
- 238000012360 testing method Methods 0.000 description 8
- 238000009826 distribution Methods 0.000 description 6
- 238000010801 machine learning Methods 0.000 description 6
- 238000000926 separation method Methods 0.000 description 6
- 230000006399 behavior Effects 0.000 description 4
- 230000003542 behavioural effect Effects 0.000 description 4
- 230000008774 maternal effect Effects 0.000 description 4
- 238000004891 communication Methods 0.000 description 3
- 238000007635 classification algorithm Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005065 mining Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 230000008451 emotion Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/01—Social networking
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Business, Economics & Management (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Probability & Statistics with Applications (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computing Systems (AREA)
- Bioinformatics & Computational Biology (AREA)
- Health & Medical Sciences (AREA)
- Economics (AREA)
- General Health & Medical Sciences (AREA)
- Human Resources & Organizations (AREA)
- Marketing (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- Tourism & Hospitality (AREA)
- General Business, Economics & Management (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a user classification method and a server, wherein the method comprises the following steps: acquiring at least one labeled user with a first attribute based on historical service data of social network users; acquiring at least one characteristic parameter corresponding to the labeling user from at least one dimension, and determining a classification model aiming at the first attribute of the user based on the characteristic parameter of the labeling user and the first attribute corresponding to the labeling user; and based on the classification model aiming at the first attribute of the user, classifying at least one target user in the social network into a corresponding category of the first attribute.
Description
Technical Field
The present invention relates to a user information processing technology in the field of communications, and in particular, to a user classification method and a server.
Background
In current social networks and media information delivery systems, media information is classified and delivered by directly using attribute contents such as emotion/love states, which are filled in by users registered in the social networks. However, the content of the user filling in the attribute has two problems as follows: firstly, covering user incompletion: the user may not actively perform the filling in of the attributes; secondly, the content is inaccurate: the attribute is lack of timeliness because of the problem of overdue and untimely update. It can be seen that the current social network may suffer from inaccurate classification based on user-filled attributes.
Disclosure of Invention
In view of the above, the present invention provides a user classifying method and a server, which can solve at least the above problems in the prior art.
In order to achieve the purpose, the technical scheme of the invention is realized as follows:
the embodiment of the invention provides a user classification method, which comprises the following steps:
acquiring at least one labeled user with a first attribute based on historical service data of social network users; wherein the first attribute is used for representing the love and marriage state of the social network user;
acquiring at least one characteristic parameter corresponding to the labeling user from at least one dimension, and determining a classification model aiming at the first attribute of the user based on the characteristic parameter of the labeling user and the first attribute corresponding to the labeling user;
and based on the classification model aiming at the first attribute of the user, classifying at least one target user in the social network into a corresponding category of the first attribute.
An embodiment of the present invention provides a server, including:
the system comprises a user acquisition unit, a service processing unit and a service processing unit, wherein the user acquisition unit is used for acquiring at least one labeled user with a first attribute based on historical service data of social network users; wherein the first attribute is used for representing the love and marriage state of the social network user;
the model establishing unit is used for acquiring at least one characteristic parameter corresponding to the labeling user from at least one dimension, and determining a classification model aiming at the first attribute of the user based on the characteristic parameter of the labeling user and the first attribute corresponding to the labeling user;
and the classification unit is used for classifying at least one target user in the social network into a corresponding class of the first attribute based on the classification model aiming at the first attribute of the user.
The embodiment of the invention provides a user classification method and a server, wherein at least one labeled user with a first attribute is obtained based on historical service data, a classification model for the first attribute of the user is determined based on at least one characteristic parameter of at least one dimension and the first attribute of the labeled user, and the classification is carried out on at least one target user according to the classification model. Therefore, the problem that the target user cannot be accurately classified due to the fact that the first attribute is not filled in by the user or the first attribute is filled in by the user is out of date can be avoided.
Drawings
FIG. 1 is a flowchart illustrating a user classification method according to an embodiment of the present invention;
FIG. 2 is a first schematic diagram of selecting a labeled user scene according to an embodiment of the present invention;
FIG. 3 is a first schematic diagram of selecting a labeled user scene according to an embodiment of the present invention;
FIG. 4 is a first schematic diagram of selecting a labeled user scene according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a user feature extraction scenario according to an embodiment of the present invention;
FIG. 6 is a diagram illustrating feature extraction according to an embodiment of the present invention;
FIG. 7 is a logic diagram illustrating the establishment of a classification model according to an embodiment of the present invention;
FIG. 8 is a diagram illustrating a server component structure according to an embodiment of the present invention;
fig. 9 is a schematic diagram of a hardware component structure of a server according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.
The first embodiment,
An embodiment of the present invention provides a user classification method, as shown in fig. 1, the method includes:
step 101: acquiring at least one labeled user with a first attribute based on historical service data of social network users; wherein the first attribute is used for representing the love and marriage state of the social network user;
step 102: acquiring at least one characteristic parameter corresponding to the labeling user from at least one dimension, and determining a classification model aiming at the first attribute of the user based on the characteristic parameter of the labeling user and the first attribute corresponding to the labeling user;
step 103: and based on the classification model aiming at the first attribute of the user, classifying at least one target user in the social network into a corresponding category of the first attribute.
Here, the scheme provided by the present embodiment may be applied to the server side.
The classification model for the first attribute of the user takes the characteristic parameter of the user as an input parameter and takes the category of the first attribute corresponding to the user as an output parameter.
Before the step 101 is executed to acquire at least one tagged user with the first attribute, the method further includes:
selecting at least one first-class initial user with a first attribute as a first class based on historical service data of the social network users; the first attribute comprises a first category and a second category, and the first category is different from the second category; the first attribute may be a marital status of the user; correspondingly, the categories corresponding to the first attribute can be two, the first category can be married, and the second category can be unmarried;
determining common characteristics corresponding to the first type initial users based on historical service data of the first type initial users;
selecting at least one second type initial user with the difference value of the common characteristics of the first type initial users exceeding a preset threshold value from the social network based on the common characteristics corresponding to the first type initial users, and setting the first attribute of the second type initial user as a second type;
and establishing a classification model aiming at the first attribute of the user based on the historical service data of the first class of initial users and the second class of initial users.
The method for selecting at least one first-class initial user with a first attribute of a first class may include: and selecting the users with the first attribute as the first type of initial users according to the historical service data of the users. The first category is married and, correspondingly, the first category of initial users is married users. Here, the first category of initial users is selected because it is assumed that the love and marriage status filled in when the social network users register is accurate, and there are only some problems that the status is not updated in time for a long time.
Selecting at least one second type initial user from all users except the at least one first type initial user, as shown in fig. 2, that is, regarding the at least one first type initial user as Positive example (Positive data), randomly selecting a preset proportion of second type initial users from all users left after the first type initial users are planed as Negative examples (Negative data), that is, un-labeled data (un labeled data), and establishing and training a classification model for the first attribute of the user based on the first type initial users and the second type initial users as training data.
The preset proportion can be set according to actual conditions, for example, 30% of users can be selected from the rest of users as second-class initial users; alternatively, 50% of the users may be selected as the second type of initial users.
The classification Model for the first attribute of the user may be a binary classifier, which is used to determine whether the user is married, and a Logistic Regression (LR) machine learning algorithm is used to train the Model, i.e., an LR Model.
Further, the acquiring of the at least one annotation user with the first attribute may include:
selecting at least one user with a first attribute as a user to be processed based on historical service data of the social network users;
classifying the user to be processed based on the classification model aiming at the first attribute of the user to obtain a classification result aiming at the user to be processed;
and determining the probability that the first attribute of the user to be processed is the same as the corresponding classification result of the user to be processed, and selecting the user to be processed with the probability higher than a preset probability threshold value as the labeled user.
The content set in the first attribute may be obtained based on a tag of the user. In the at least one user with the first attribute, when the user sets the first attribute, there may be a plurality of setting contents, which may include: married, not married, single, child, newly-married, in-love, married, separated, and different contents;
correspondingly, when determining the probability that the first attribute of the user to be processed is the same as the corresponding classification result, first, according to the content set in the first attribute of the user to be processed, a corresponding category may be selected for the user to be processed, for example, the content set in the first attribute that may correspond to the married category includes: married, newly-married, with children; the contents set in the first attribute corresponding to the unmarried category are as follows: singles, uncombinations, love centers, engagement, separation, and dissimilarities, among others.
On the basis of fig. 2, the process of Data Acquisition (Data Acquisition) described above is described with reference to fig. 3, specifically: classifying and estimating all users with a love filling state in the social network, judging whether the users are married crowds or not, wherein the probability is p (c | instance), and reserving data meeting the following conditions as a multi-classification candidate training data set:
p(c=0|instance,label=0)>threshold1
p(c=1|instance,label=1)>threshold2
wherein c is an estimated category of the classification model for the first attribute of the user, namely whether the user is married is judged based on at least one second attribute of the user and the classification model; instance is the pending user and label is the type of instance label, i.e., "married". Threshold represents a cutoff Threshold, Threshold1 is used to retain high probability populations that are predicted to be ungainted, and Threshold2 is used to retain high probability populations that are predicted to be married.
By adopting the scheme, at least one labeling user with a first attribute can be acquired based on historical service data, a classification model for the first attribute of the user is determined based on at least one characteristic parameter of at least one dimension and the first attribute of the labeling user, and the classification of at least one target user is divided according to the classification model. Therefore, the problem that the target user cannot be accurately classified due to the fact that the first attribute is not filled in by the user or the first attribute is filled in by the user is out of date can be avoided.
Example II,
An embodiment of the present invention provides a user classification method, as shown in fig. 1, the method includes:
step 101: acquiring at least one labeled user with a first attribute based on historical service data of social network users; wherein the first attribute is used for representing the love and marriage state of the social network user;
step 102: acquiring at least one characteristic parameter corresponding to the labeling user from at least one dimension, and determining a classification model aiming at the first attribute of the user based on the characteristic parameter of the labeling user and the first attribute corresponding to the labeling user;
step 103: and based on the classification model aiming at the first attribute of the user, classifying at least one target user in the social network into a corresponding category of the first attribute.
Here, the scheme provided by the present embodiment may be applied to the server side.
The classification model for the first attribute of the user takes the characteristic parameter of the user as an input parameter and takes the category of the first attribute corresponding to the user as an output parameter.
Before the step 101 is executed to acquire at least one tagged user with the first attribute, the method further includes:
selecting at least one first-class initial user with a first attribute as a first class based on historical service data of the social network users; the first attribute comprises a first category and a second category, and the first category is different from the second category; the first attribute may be a marital status of the user; correspondingly, the categories corresponding to the first attribute can be two, the first category can be married, and the second category can be unmarried;
determining common characteristics corresponding to the first type initial users based on historical service data of the first type initial users;
selecting at least one second type initial user with the difference value of the common characteristics of the first type initial users exceeding a preset threshold value from the social network based on the common characteristics corresponding to the first type initial users;
and establishing a classification model aiming at the first attribute of the user based on the historical service data of the first class of initial users and the second class of initial users.
The method for selecting at least one first-class initial user with a first attribute of a first class may include: and selecting the users with the first attribute as the first type of initial users according to the historical service data of the users. The first category is married and, correspondingly, the first category of initial users is married users. Here, the first category of initial users is selected because it is assumed that the love and marriage status filled in when the social network users register is accurate, and there are only some problems that the status is not updated in time for a long time.
Based on the above operations, this embodiment further provides that selecting at least one second type of initial user from all users excluding the at least one first type of initial user includes:
determining common characteristics corresponding to the first type initial users based on historical service data of the first type initial users;
and selecting at least one second type initial user with the difference value of the common characteristics of the first type initial users exceeding a preset threshold value from the social network based on the common characteristics corresponding to the first type initial users.
For the selection of negative examples, the random strategy may cause that data which should be Positive and is not marked exists in the Unlabeled data, and because the proportion of married users is high in reality, the more reliable negative examples can be randomly selected from the data which is more different from the known Positive data for training. The cosine similarity between sample features (such as interest preference distribution) can be used as a criterion.
The classification Model for the first attribute of the user may be a binary classifier, which is used to determine whether the user is married, and a Logistic Regression (LR) machine learning algorithm is used to train the Model, i.e., an LR Model.
Further, the acquiring of the at least one annotation user with the first attribute may include:
selecting at least one user with a first attribute as a user to be processed based on historical service data of the social network users;
classifying the user to be processed based on the classification model aiming at the first attribute of the user to obtain a classification result aiming at the user to be processed;
and determining the probability that the first attribute of the user to be processed is the same as the corresponding classification result of the user to be processed, and selecting the user to be processed with the probability higher than a preset probability threshold value as the labeled user.
The content set in the first attribute may be obtained based on a tag of the user. In the at least one user with the first attribute, when the user sets the first attribute, there may be a plurality of setting contents, which may include: married, not married, single, child, newly-married, in-love, married, separated, and different contents;
correspondingly, when determining the probability that the first attribute of the user to be processed is the same as the corresponding classification result, first, according to the content set in the first attribute of the user to be processed, a corresponding category may be selected for the user to be processed, for example, the content set in the first attribute that may correspond to the married category includes: married, newly-married, with children; the contents set in the first attribute corresponding to the unmarried category are as follows: singles, uncombinations, love centers, engagement, separation, and dissimilarities, among others.
Preferably, in this embodiment, after selecting the labeled user, the quality of the training data is further ensured, and the labeled user is further calibrated, specifically, after selecting the user to be processed whose probability is higher than the preset probability threshold value as the labeled user, the method further includes:
respectively acquiring historical service data corresponding to the labeled users from at least one dimension;
and screening the labeled users based on the historical service data of the at least one dimension to obtain the screened labeled users.
Wherein the at least one dimension may comprise at least one of: the frequency of preset types of websites browsed by a user; the type of the user group the user joins; the type of target data operated by the user; and the content corresponds to the attribute of the preset type of the user. The preset type can be a wedding type website; the user group can be a single body group, a mother-and-baby group and the like; the target data of the operation may be a type of photograph in the album.
For example, users who frequently browse dating websites cannot be in a non-single training set, users who frequently live in the maternal and infant group cannot be in a non-married & nursery training set, and users who contain wedding photos in albums cannot appear in a non-newcastle & married training set.
Selecting at least one second type initial user from all users except the at least one first type initial user, as shown in fig. 2, that is, regarding the at least one first type initial user as Positive example (Positive data), randomly selecting a preset proportion of second type initial users from all users left after the first type initial users are planed as Negative examples (Negative data), that is, un-labeled data (un labeled data), and establishing and training a classification model for the first attribute of the user based on the first type initial users and the second type initial users as training data.
On the basis of fig. 2, the process of Data Acquisition (Data Acquisition) described above is described with reference to fig. 3, specifically: classifying and estimating all users with a love filling state in the social network, judging whether the users are married crowds or not, wherein the probability is p (c | instance), and reserving data meeting the following conditions as a multi-classification candidate training data set:
p(c=0|instance,label=0)>threshold1
p(c=1|instance,label=1)>threshold2
wherein c is an estimated category of the classification model for the first attribute of the user, namely whether the user is married is judged based on at least one second attribute of the user and the classification model; instance is the pending user and label is the type of instance label, i.e., "married". Threshold represents a cutoff Threshold, Threshold1 is used to retain high probability populations that are predicted to be ungainted, and Threshold2 is used to retain high probability populations that are predicted to be married.
With further reference to fig. 4, Data Calibration (Data Calibration): in order to further ensure the quality of the training data, rules are manually defined, and the candidate training data set is corrected as follows: users with high accuracy in each state are collected, for example, users who frequently browse websites of marriage and dating cannot be in a non-single training set, users who frequently live in the maternal and infant group cannot be in a non-married & nursery training set, users who contain wedding photos in albums cannot be in a non-newcastle & married training set, and the like. Users under 18 years of age may only be "love" or "single". Therefore, a large number of user labeling data sets with love and marriage states can be obtained for training the model.
By adopting the scheme, at least one labeling user with a first attribute can be acquired based on historical service data, a classification model for the first attribute of the user is determined based on at least one characteristic parameter of at least one dimension and the first attribute of the labeling user, and the classification of at least one target user is divided according to the classification model. Therefore, the problem that the target user cannot be accurately classified due to the fact that the first attribute is not filled in by the user or the first attribute is filled in by the user is out of date can be avoided.
Example III,
An embodiment of the present invention provides a user classification method, as shown in fig. 1, the method includes:
step 101: acquiring at least one labeled user with a first attribute based on historical service data of social network users; wherein the first attribute is used for representing the love and marriage state of the social network user;
step 102: acquiring at least one characteristic parameter corresponding to the labeling user from at least one dimension, and determining a classification model aiming at the first attribute of the user based on the characteristic parameter of the labeling user and the first attribute corresponding to the labeling user;
step 103: and based on the classification model aiming at the first attribute of the user, classifying at least one target user in the social network into a corresponding category of the first attribute.
Here, the scheme provided by the present embodiment may be applied to the server side.
The classification model for the first attribute of the user takes the characteristic parameter of the user as an input parameter and takes the category of the first attribute corresponding to the user as an output parameter.
Before the step 101 is executed to acquire at least one tagged user with the first attribute, the method further includes:
selecting at least one first-class initial user with a first attribute as a first class based on historical service data of the social network users; the first attribute comprises a first category and a second category, and the first category is different from the second category; the first attribute may be a marital status of the user; correspondingly, the categories corresponding to the first attribute can be two, the first category can be married, and the second category can be unmarried;
determining common characteristics corresponding to the first type initial users based on historical service data of the first type initial users;
selecting at least one second type initial user with the difference value of the common characteristics of the first type initial users exceeding a preset threshold value from the social network based on the common characteristics corresponding to the first type initial users;
and establishing a classification model aiming at the first attribute of the user based on the historical service data of the first class of initial users and the second class of initial users.
The method for selecting at least one first-class initial user with a first attribute of a first class may include: and selecting the users with the first attribute as the first type of initial users according to the historical service data of the users. The first category is married and, correspondingly, the first category of initial users is married users. Here, the first category of initial users is selected because it is assumed that the love and marriage status filled in when the social network users register is accurate, and there are only some problems that the status is not updated in time for a long time.
The preset proportion can be set according to actual conditions, for example, 30% of users can be selected from the rest of users as second-class initial users; alternatively, 50% of the users may be selected as the second type of initial users.
Based on the above operations, this embodiment further provides that selecting at least one second type of initial user from all users excluding the at least one first type of initial user includes:
determining common characteristics corresponding to the first type initial users based on historical service data of the first type initial users;
and selecting at least one second type initial user with the difference value of the common characteristics of the first type initial users exceeding a preset threshold value from the social network based on the common characteristics corresponding to the first type initial users.
For the selection of negative examples, the random strategy may cause that data which should be Positive and is not marked exists in the Unlabeled data, and because the proportion of married users is high in reality, the more reliable negative examples can be randomly selected from the data which is more different from the known Positive data for training. The cosine similarity between sample features (such as interest preference distribution) can be used as a criterion.
The classification Model for the first attribute of the user may be a binary classifier, which is used to determine whether the user is married, and a Logistic Regression (LR) machine learning algorithm is used to train the Model, i.e., an LR Model.
Further, the acquiring of the at least one annotation user with the first attribute may include:
selecting at least one user with a first attribute as a user to be processed based on historical service data of the social network users;
classifying the user to be processed based on the classification model aiming at the first attribute of the user to obtain a classification result aiming at the user to be processed;
and determining the probability that the first attribute of the user to be processed is the same as the corresponding classification result of the user to be processed, and selecting the user to be processed with the probability higher than a preset probability threshold value as the labeled user.
The content set in the first attribute may be obtained based on a tag of the user. In the at least one user with the first attribute, when the user sets the first attribute, there may be a plurality of setting contents, which may include: married, not married, single, child, newly-married, in-love, married, separated, and different contents;
correspondingly, when determining the probability that the first attribute of the user to be processed is the same as the corresponding classification result, first, according to the content set in the first attribute of the user to be processed, a corresponding category may be selected for the user to be processed, for example, the content set in the first attribute that may correspond to the married category includes: married, newly-married, with children; the contents set in the first attribute corresponding to the unmarried category are as follows: singles, uncombinations, love centers, engagement, separation, and dissimilarities, among others.
Preferably, in this embodiment, after selecting the labeled user, the quality of the training data is further ensured, and the labeled user is further calibrated, specifically, after selecting the user to be processed whose probability is higher than the preset probability threshold value as the labeled user, the method further includes:
respectively acquiring historical service data corresponding to the labeled users from at least one dimension;
and screening the labeled users based on the historical service data of the at least one dimension to obtain the screened labeled users.
Wherein the at least one dimension may comprise at least one of: the frequency of preset types of websites browsed by a user; the type of the user group the user joins; the type of target data operated by the user; and the content corresponds to the attribute of the preset type of the user. The preset type can be a wedding type website; the user group can be a single body group, a mother-and-baby group and the like; the target data of the operation may be a type of photograph in the album.
Further, the user love and marriage state classifier focuses on user feature extraction and classification algorithm design. Among them, it is most important to extract effective features. Referring to fig. 5, the data source represents data of a user to be subjected to feature extraction, the feature extraction may be feature extraction according to at least one dimension, the feature expression of normal distribution is performed, and features that are not coincident with each other are selected from the extracted features.
In this embodiment, a description is given of establishment, training, and adjustment of a classification model of a first attribute of a user, where the obtaining of at least one feature parameter corresponding to the labeled user from at least one dimension includes at least one of:
acquiring basic attribute parameters of an annotation user based on historical service data of the annotation user;
acquiring operation parameters of a labeling user for target data based on historical service data of the labeling user;
and acquiring interactive characteristic parameters determined by interactive data between the annotation user and other users except the annotation user based on historical service data of the annotation user.
As shown in fig. 6, the following categories can be included:
demographic attributes (Demographics): the basic attribute information of the user comprises age, gender, occupation, education degree, consumption habit, hometown, frequent residence and the like;
behavioral preferences (Behavioral): the commercial interest and the keyword Tag of the user, and the mining sources comprise groups, advertisement clicks, mobile App, webpage browsing and the like;
remarketing Rule (Remarketing Rule): and the rule identification information is generated according to the user identification number packet uploaded by the advertiser, and the advertisement information can be associated according to the rule identification information.
Further, the above-mentioned at least one characteristic parameter is explained:
the basic attribute parameters of the labeling users comprise at least one of the following parameters: logging in position information, logging in time period, group joining a preset name and interaction frequency of the group;
the operation parameters of the labeling user for the target data at least comprise: an operation frequency and an operation period for preset types of target information;
the interaction characteristic parameters determined by the interaction data between the labeling user and other users except the labeling user comprise at least one of the following parameters: the gender attribute of the other users, the interaction frequency between the other users and the labeling user, and the login address information of the other users.
Correspondingly, based on the historical service data of the at least one dimension, the annotation users are screened to obtain screened annotation users, which may be at least one of the following:
the operation frequency and the operation time period for the preset type of target information meet the conditions of the preset frequency and the preset time period; for example, LBS behavior: younger people always active in the campus are more likely to be singles or love; on-line time period: an overnight online user is more likely to be an unmarried user; friend group name: whether a specifically named packet is included, and the interaction frequency;
the interactive characteristic parameters determined by the interactive data between the labeling user and other users except the labeling user meet preset conditions;
for example, the gender attribute of the other user is different from the gender attribute of the tagging user, that is, the tagging user frequently chats with a friend of opposite sex, and is more likely to be a non-individual user, and certainly, whether the tagging user and the other user both satisfy the preset condition, that is, whether the tagging user and the other user are the only interactive objects of the other user can be considered; whether other users are friends containing specific names or not and the interaction frequency between the other users and the friends can be judged;
judging based on the login behaviors of the marked user and other users, for example, whether two friends of men and women frequently log in through the same IP, particularly distinguishing evening, weekends and holidays;
in addition, the love and marriage status of the other users can be acquired: the dating status of friends who contact more is more likely to be consistent.
Judging whether the operating frequency of the target information of the preset type meets a frequency threshold value or not and whether the operating time interval meets the requirement of the preset time interval or not based on the operating frequency and the operating time interval of the target information of the preset type;
for example, album classification: whether newly-married photo albums or not are uploaded recently;
alternatively, the UGC dynamics: whether characters of lovers, newborns and nurses are published recently.
Referring to fig. 7, on the basis of fig. 5, one or more features may be selected from the plurality of features extracted from the left side according to the feature configuration as the user features; then, after matching is carried out according to the labeling data formed by the labeling users and the user characteristics, training data and test data are obtained; the training data and the test data can be selected according to actual conditions, for example, one of every 4 data can be selected as the test data, and the rest of the 4 data can be selected as the training data;
training the classification model based on training data, wherein the training can be performed by training the classification model by taking a plurality of characteristics of a user as input data and a known type corresponding to the user as a result;
predicting the classification model based on the test data, wherein the prediction can be performed by taking a plurality of characteristics of the user as input data, obtaining a corresponding output result based on the classification model, judging the probability of matching the output result with the type of the user, and determining that the classification model is successfully established when the probability is higher than a preset threshold value; otherwise, continuing training.
Classification model building and training we simultaneously try to use two strategies: the method comprises the steps that a single Softmax Regression multi-element classifier and a plurality of One-vs-All logic Regression binary classifiers are used for selecting an optimal classifier strategy and parameter and learning a model by adjusting and optimizing training data scale, positive and negative example proportion, optimization algorithm, regular factor and the like. And finally, performing classification estimation on all users, and selecting a wedding label with the maximum probability for each user. In order to ensure the accuracy, a threshold value set for the maximum probability can be truncated, and finally the accuracy and the user coverage balance are ensured to achieve the best effect.
By adopting the scheme, at least one labeling user with a first attribute can be acquired based on historical service data, a classification model for the first attribute of the user is determined based on at least one characteristic parameter of at least one dimension and the first attribute of the labeling user, and the classification of at least one target user is divided according to the classification model. Therefore, the problem that the target user cannot be accurately classified due to the fact that the first attribute is not filled in by the user or the first attribute is filled in by the user is out of date can be avoided.
Example four,
An embodiment of the present invention provides a server, as shown in fig. 8, including:
the user obtaining unit 81 is configured to obtain at least one tagged user with a first attribute based on historical service data of a social network user; wherein the first attribute is used for representing the love and marriage state of the social network user;
the model establishing unit 82 is configured to acquire at least one feature parameter corresponding to the labeling user from at least one dimension, and determine a classification model for a first attribute of the user based on the feature parameter of the labeling user and the first attribute corresponding to the labeling user;
the classification unit 83 is configured to classify, for at least one target user in the social network, a category of the corresponding first attribute based on the classification model for the first attribute of the user.
Here, the scheme provided by the present embodiment may be applied to the server side.
The classification model for the first attribute of the user takes the characteristic parameter of the user as an input parameter and takes the category of the first attribute corresponding to the user as an output parameter.
The user obtaining unit 81 is configured to select at least one first-class initial user with a first attribute being a first class based on historical service data of the social network user; the first attribute comprises a first category and a second category, and the first category is different from the second category; the first attribute may be a marital status of the user; correspondingly, the categories corresponding to the first attribute can be two, the first category can be married, and the second category can be unmarried; determining common characteristics corresponding to the first type initial users based on historical service data of the first type initial users; selecting at least one second type initial user with the difference value of the common characteristics of the first type initial users exceeding a preset threshold value from the social network based on the common characteristics corresponding to the first type initial users, and setting the first attribute of the second type initial user as a second type; and establishing a classification model aiming at the first attribute of the user based on the historical service data of the first class of initial users and the second class of initial users.
The method for selecting at least one first-class initial user with a first attribute of a first class may include: and selecting the users with the first attribute as the first type of initial users according to the historical service data of the users. The first category is married and, correspondingly, the first category of initial users is married users. Here, the first category of initial users is selected because it is assumed that the love and marriage status filled in when the social network users register is accurate, and there are only some problems that the status is not updated in time for a long time.
Selecting at least one second type initial user from all users except the at least one first type initial user, as shown in fig. 2, that is, regarding the at least one first type initial user as Positive example (Positive data), randomly selecting a preset proportion of second type initial users from all users left after the first type initial users are planed as Negative examples (Negative data), that is, un-labeled data (un labeled data), and establishing and training a classification model for the first attribute of the user based on the first type initial users and the second type initial users as training data.
The preset proportion can be set according to actual conditions, for example, 30% of users can be selected from the rest of users as second-class initial users; alternatively, 50% of the users may be selected as the second type of initial users.
The classification Model for the first attribute of the user may be a binary classifier, which is used to determine whether the user is married, and a Logistic Regression (LR) machine learning algorithm is used to train the Model, i.e., an LR Model.
Further, the user obtaining unit 81 is configured to select, based on the historical service data of the social network user, at least one user with a first attribute as a user to be processed; classifying the user to be processed based on the classification model aiming at the first attribute of the user to obtain a classification result aiming at the user to be processed; and determining the probability that the first attribute of the user to be processed is the same as the corresponding classification result of the user to be processed, and selecting the user to be processed with the probability higher than a preset probability threshold value as the labeled user.
The content set in the first attribute may be obtained based on a tag of the user. In the at least one user with the first attribute, when the user sets the first attribute, there may be a plurality of setting contents, which may include: married, not married, single, child, newly-married, in-love, married, separated, and different contents;
correspondingly, when determining the probability that the first attribute of the user to be processed is the same as the corresponding classification result, first, according to the content set in the first attribute of the user to be processed, a corresponding category may be selected for the user to be processed, for example, the content set in the first attribute that may correspond to the married category includes: married, newly-married, with children; the contents set in the first attribute corresponding to the unmarried category are as follows: singles, uncombinations, love centers, engagement, separation, and dissimilarities, among others.
On the basis of fig. 2, the process of Data Acquisition (Data Acquisition) described above is described with reference to fig. 3, specifically: classifying and estimating all users with a love filling state in the social network, judging whether the users are married crowds or not, wherein the probability is p (c | instance), and reserving data meeting the following conditions as a multi-classification candidate training data set:
p(c=0|instance,label=0)>threshold1
p(c=1|instance,label=1)>threshold2
wherein c is an estimated category of the classification model for the first attribute of the user, namely whether the user is married is judged based on at least one second attribute of the user and the classification model; instance is the pending user and label is the type of instance label, i.e., "married". Threshold represents a cutoff Threshold, Threshold1 is used to retain high probability populations that are predicted to be ungainted, and Threshold2 is used to retain high probability populations that are predicted to be married.
By adopting the scheme, at least one labeling user with a first attribute can be acquired based on historical service data, a classification model for the first attribute of the user is determined based on at least one characteristic parameter of at least one dimension and the first attribute of the labeling user, and the classification of at least one target user is divided according to the classification model. Therefore, the problem that the target user cannot be accurately classified due to the fact that the first attribute is not filled in by the user or the first attribute is filled in by the user is out of date can be avoided.
Example V,
An embodiment of the present invention provides a server, as shown in fig. 8, including:
the user obtaining unit 81 is configured to obtain at least one tagged user with a first attribute based on historical service data of a social network user; wherein the first attribute is used for representing the love and marriage state of the social network user;
the model establishing unit 82 is configured to acquire at least one feature parameter corresponding to the labeling user from at least one dimension, and determine a classification model for a first attribute of the user based on the feature parameter of the labeling user and the first attribute corresponding to the labeling user;
the classification unit 83 is configured to classify, for at least one target user in the social network, a category of the corresponding first attribute based on the classification model for the first attribute of the user.
The classification model for the first attribute of the user takes the characteristic parameter of the user as an input parameter and takes the category of the first attribute corresponding to the user as an output parameter.
The user obtaining unit 81 is configured to select at least one first-class initial user with a first attribute being a first class based on historical service data of the social network user; the first attribute comprises a first category and a second category, and the first category is different from the second category; the first attribute may be a marital status of the user; correspondingly, the categories corresponding to the first attribute can be two, the first category can be married, and the second category can be unmarried; determining common characteristics corresponding to the first type initial users based on historical service data of the first type initial users; selecting at least one second type initial user with the difference value of the common characteristics of the first type initial users exceeding a preset threshold value from the social network based on the common characteristics corresponding to the first type initial users; and establishing a classification model aiming at the first attribute of the user based on the historical service data of the first class of initial users and the second class of initial users.
Based on the above operations, this embodiment further provides that the at least one second type initial user is selected from all users excluding the at least one first type initial user, and the user obtaining unit 81 is configured to determine, based on historical service data of the first type initial user, a common feature corresponding to the first type initial user; and selecting at least one second type initial user with the difference value of the common characteristics of the first type initial users exceeding a preset threshold value from the social network based on the common characteristics corresponding to the first type initial users.
For the selection of negative examples, the random strategy may cause that data which should be Positive and is not marked exists in the Unlabeled data, and because the proportion of married users is high in reality, the more reliable negative examples can be randomly selected from the data which is more different from the known Positive data for training. The cosine similarity between sample features (such as interest preference distribution) can be used as a criterion.
The classification Model for the first attribute of the user may be a binary classifier, which is used to determine whether the user is married, and a Logistic Regression (LR) machine learning algorithm is used to train the Model, i.e., an LR Model.
Further, the user obtaining unit 81 is configured to select, based on the historical service data of the social network user, at least one user with a first attribute as a user to be processed; classifying the user to be processed based on the classification model aiming at the first attribute of the user to obtain a classification result aiming at the user to be processed; and determining the probability that the first attribute of the user to be processed is the same as the corresponding classification result of the user to be processed, and selecting the user to be processed with the probability higher than a preset probability threshold value as the labeled user.
The content set in the first attribute may be obtained based on a tag of the user. In the at least one user with the first attribute, when the user sets the first attribute, there may be a plurality of setting contents, which may include: married, not married, single, child, newly-married, in-love, married, separated, and different contents;
correspondingly, when determining the probability that the first attribute of the user to be processed is the same as the corresponding classification result, first, according to the content set in the first attribute of the user to be processed, a corresponding category may be selected for the user to be processed, for example, the content set in the first attribute that may correspond to the married category includes: married, newly-married, with children; the contents set in the first attribute corresponding to the unmarried category are as follows: singles, uncombinations, love centers, engagement, separation, and dissimilarities, among others.
Preferably, after the annotated user is selected, the quality of the training data is further ensured, and the annotated user is further calibrated, specifically, after the user to be processed whose probability is higher than a preset probability threshold value is selected as the annotated user, the user obtaining unit 81 is configured to obtain the historical service data corresponding to the annotated user from at least one dimension; and screening the labeled users based on the historical service data of the at least one dimension to obtain the screened labeled users.
Wherein the at least one dimension may comprise at least one of: the frequency of preset types of websites browsed by a user; the type of the user group the user joins; the type of target data operated by the user; and the content corresponds to the attribute of the preset type of the user. The preset type can be a wedding type website; the user group can be a single body group, a mother-and-baby group and the like; the target data of the operation may be a type of photograph in the album.
For example, users who frequently browse dating websites cannot be in a non-single training set, users who frequently live in the maternal and infant group cannot be in a non-married & nursery training set, and users who contain wedding photos in albums cannot appear in a non-newcastle & married training set.
Selecting at least one second type initial user from all users except the at least one first type initial user, as shown in fig. 2, that is, regarding the at least one first type initial user as Positive example (Positive data), randomly selecting a preset proportion of second type initial users from all users left after the first type initial users are planed as Negative examples (Negative data), that is, un-labeled data (un labeled data), and establishing and training a classification model for the first attribute of the user based on the first type initial users and the second type initial users as training data.
On the basis of fig. 2, the process of Data Acquisition (Data Acquisition) described above is described with reference to fig. 3, specifically: classifying and estimating all users with a love filling state in the social network, judging whether the users are married crowds or not, wherein the probability is p (c | instance), and reserving data meeting the following conditions as a multi-classification candidate training data set:
p(c=0|instance,label=0)>threshold1
p(c=1|instance,label=1)>threshold2
wherein c is an estimated category of the classification model for the first attribute of the user, namely whether the user is married is judged based on at least one second attribute of the user and the classification model; instance is the pending user and label is the type of instance label, i.e., "married". Threshold represents a cutoff Threshold, Threshold1 is used to retain high probability populations that are predicted to be ungainted, and Threshold2 is used to retain high probability populations that are predicted to be married.
With further reference to fig. 4, Data Calibration (Data Calibration): in order to further ensure the quality of the training data, rules are manually defined, and the candidate training data set is corrected as follows: users with high accuracy in each state are collected, for example, users who frequently browse websites of marriage and dating cannot be in a non-single training set, users who frequently live in the maternal and infant group cannot be in a non-married & nursery training set, users who contain wedding photos in albums cannot be in a non-newcastle & married training set, and the like. Users under 18 years of age may only be "love" or "single". Therefore, a large number of user labeling data sets with love and marriage states can be obtained for training the model.
By adopting the scheme, at least one labeling user with a first attribute can be acquired based on historical service data, a classification model for the first attribute of the user is determined based on at least one characteristic parameter of at least one dimension and the first attribute of the labeling user, and the classification of at least one target user is divided according to the classification model. Therefore, the problem that the target user cannot be accurately classified due to the fact that the first attribute is not filled in by the user or the first attribute is filled in by the user is out of date can be avoided.
Example six,
An embodiment of the present invention provides a server, as shown in fig. 8, including:
the user obtaining unit 81 is configured to obtain at least one tagged user with a first attribute based on historical service data of a social network user; wherein the first attribute is used for representing the love and marriage state of the social network user;
the model establishing unit 82 is configured to acquire at least one feature parameter corresponding to the labeling user from at least one dimension, and determine a classification model for a first attribute of the user based on the feature parameter of the labeling user and the first attribute corresponding to the labeling user;
the classification unit 83 is configured to classify, for at least one target user in the social network, a category of the corresponding first attribute based on the classification model for the first attribute of the user.
The classification model for the first attribute of the user takes the characteristic parameter of the user as an input parameter and takes the category of the first attribute corresponding to the user as an output parameter.
The user obtaining unit 81 is configured to select at least one first-class initial user with a first attribute being a first class based on historical service data of the social network user; the first attribute comprises a first category and a second category, and the first category is different from the second category; the first attribute may be a marital status of the user; correspondingly, the categories corresponding to the first attribute can be two, the first category can be married, and the second category can be unmarried; determining common characteristics corresponding to the first type initial users based on historical service data of the first type initial users; selecting at least one second type initial user with the difference value of the common characteristics of the first type initial users exceeding a preset threshold value from the social network based on the common characteristics corresponding to the first type initial users; and establishing a classification model aiming at the first attribute of the user based on the historical service data of the first class of initial users and the second class of initial users.
The method for selecting at least one first-class initial user with a first attribute of a first class may include: and selecting the users with the first attribute as the first type of initial users according to the historical service data of the users. The first category is married and, correspondingly, the first category of initial users is married users. Here, the first category of initial users is selected because it is assumed that the love and marriage status filled in when the social network users register is accurate, and there are only some problems that the status is not updated in time for a long time.
The preset proportion can be set according to actual conditions, for example, 30% of users can be selected from the rest of users as second-class initial users; alternatively, 50% of the users may be selected as the second type of initial users.
A user obtaining unit 81, configured to determine, based on historical service data of the first type of initial user, a common feature corresponding to the first type of initial user; and selecting at least one second type initial user with the difference value of the common characteristics of the first type initial users exceeding a preset threshold value from the social network based on the common characteristics corresponding to the first type initial users.
For the selection of negative examples, the random strategy may cause that data which should be Positive and is not marked exists in the Unlabeled data, and because the proportion of married users is high in reality, the more reliable negative examples can be randomly selected from the data which is more different from the known Positive data for training. The cosine similarity between sample features (such as interest preference distribution) can be used as a criterion.
The classification Model for the first attribute of the user may be a binary classifier, which is used to determine whether the user is married, and a Logistic Regression (LR) machine learning algorithm is used to train the Model, i.e., an LR Model.
Further, the user obtaining unit 81 is configured to select, based on the historical service data of the social network user, at least one user with a first attribute as a user to be processed; classifying the user to be processed based on the classification model aiming at the first attribute of the user to obtain a classification result aiming at the user to be processed; and determining the probability that the first attribute of the user to be processed is the same as the corresponding classification result of the user to be processed, and selecting the user to be processed with the probability higher than a preset probability threshold value as the labeled user.
The content set in the first attribute may be obtained based on a tag of the user. In the at least one user with the first attribute, when the user sets the first attribute, there may be a plurality of setting contents, which may include: married, not married, single, child, newly-married, in-love, married, separated, and different contents;
correspondingly, when determining the probability that the first attribute of the user to be processed is the same as the corresponding classification result, first, according to the content set in the first attribute of the user to be processed, a corresponding category may be selected for the user to be processed, for example, the content set in the first attribute that may correspond to the married category includes: married, newly-married, with children; the contents set in the first attribute corresponding to the unmarried category are as follows: singles, uncombinations, love centers, engagement, separation, and dissimilarities, among others.
Preferably, after the annotated user is selected, the quality of the training data is further ensured, and the annotated user is further calibrated, specifically, after the user to be processed whose probability is higher than a preset probability threshold value is selected as the annotated user, the user obtaining unit 81 is configured to obtain the historical service data corresponding to the annotated user from at least one dimension; and screening the labeled users based on the historical service data of the at least one dimension to obtain the screened labeled users.
Wherein the at least one dimension may comprise at least one of: the frequency of preset types of websites browsed by a user; the type of the user group the user joins; the type of target data operated by the user; and the content corresponds to the attribute of the preset type of the user. The preset type can be a wedding type website; the user group can be a single body group, a mother-and-baby group and the like; the target data of the operation may be a type of photograph in the album.
Further, the user love and marriage state classifier focuses on user feature extraction and classification algorithm design. Among them, it is most important to extract effective features. Referring to fig. 5, the data source represents data of a user to be subjected to feature extraction, the feature extraction may be feature extraction according to at least one dimension, the feature expression of normal distribution is performed, and features that are not coincident with each other are selected from the extracted features.
In this embodiment, a description is given of establishment, training, and adjustment of a classification model of a first attribute of a user, where the obtaining of at least one feature parameter corresponding to the labeled user from at least one dimension includes at least one of:
acquiring basic attribute parameters of an annotation user based on historical service data of the annotation user;
acquiring operation parameters of a labeling user for target data based on historical service data of the labeling user;
and acquiring interactive characteristic parameters determined by interactive data between the annotation user and other users except the annotation user based on historical service data of the annotation user.
As shown in fig. 6, the following categories can be included:
demographic attributes (Demographics): the basic attribute information of the user comprises age, gender, occupation, education degree, consumption habit, hometown, frequent residence and the like;
behavioral preferences (Behavioral): the commercial interest and the keyword Tag of the user, and the mining sources comprise groups, advertisement clicks, mobile App, webpage browsing and the like;
remarketing Rule (Remarketing Rule): and the rule identification information is generated according to the user identification number packet uploaded by the advertiser, and the advertisement information can be associated according to the rule identification information.
Further, the above-mentioned at least one characteristic parameter is explained:
the basic attribute parameters of the labeling users comprise at least one of the following parameters: logging in position information, logging in time period, group joining a preset name and interaction frequency of the group;
the operation parameters of the labeling user for the target data at least comprise: an operation frequency and an operation period for preset types of target information;
the interaction characteristic parameters determined by the interaction data between the labeling user and other users except the labeling user comprise at least one of the following parameters: the gender attribute of the other users, the interaction frequency between the other users and the labeling user, and the login address information of the other users.
Correspondingly, based on the historical service data of the at least one dimension, the annotation users are screened to obtain screened annotation users, which may be at least one of the following:
the operation frequency and the operation time period for the preset type of target information meet the conditions of the preset frequency and the preset time period; for example, LBS behavior: younger people always active in the campus are more likely to be singles or love; on-line time period: an overnight online user is more likely to be an unmarried user; friend group name: whether a specifically named packet is included, and the interaction frequency;
the interactive characteristic parameters determined by the interactive data between the labeling user and other users except the labeling user meet preset conditions;
for example, the gender attribute of the other user is different from the gender attribute of the tagging user, that is, the tagging user frequently chats with a friend of opposite sex, and is more likely to be a non-individual user, and certainly, whether the tagging user and the other user both satisfy the preset condition, that is, whether the tagging user and the other user are the only interactive objects of the other user can be considered; whether other users are friends containing specific names or not and the interaction frequency between the other users and the friends can be judged;
judging based on the login behaviors of the marked user and other users, for example, whether two friends of men and women frequently log in through the same IP, particularly distinguishing evening, weekends and holidays;
in addition, the love and marriage status of the other users can be acquired: the dating status of friends who contact more is more likely to be consistent.
Judging whether the operating frequency of the target information of the preset type meets a frequency threshold value or not and whether the operating time interval meets the requirement of the preset time interval or not based on the operating frequency and the operating time interval of the target information of the preset type;
for example, album classification: whether newly-married photo albums or not are uploaded recently;
alternatively, the UGC dynamics: whether characters of lovers, newborns and nurses are published recently.
Referring to fig. 7, on the basis of fig. 5, one or more features may be selected from the plurality of features extracted from the left side according to the feature configuration as the user features; then, after matching is carried out according to the labeling data formed by the labeling users and the user characteristics, training data and test data are obtained; the training data and the test data can be selected according to actual conditions, for example, one of every 4 data can be selected as the test data, and the rest of the 4 data can be selected as the training data;
training the classification model based on training data, wherein the training can be performed by training the classification model by taking a plurality of characteristics of a user as input data and a known type corresponding to the user as a result;
predicting the classification model based on the test data, wherein the prediction can be performed by taking a plurality of characteristics of the user as input data, obtaining a corresponding output result based on the classification model, judging the probability of matching the output result with the type of the user, and determining that the classification model is successfully established when the probability is higher than a preset threshold value; otherwise, continuing training.
Classification model building and training we simultaneously try to use two strategies: the method comprises the steps that a single Softmax Regression multi-element classifier and a plurality of One-vs-All logic Regression binary classifiers are used for selecting an optimal classifier strategy and parameter and learning a model by adjusting and optimizing training data scale, positive and negative example proportion, optimization algorithm, regular factor and the like. And finally, performing classification estimation on all users, and selecting a wedding label with the maximum probability for each user. In order to ensure the accuracy, a threshold value set for the maximum probability can be truncated, and finally the accuracy and the user coverage balance are ensured to achieve the best effect.
By adopting the scheme, at least one labeling user with a first attribute can be acquired based on historical service data, a classification model for the first attribute of the user is determined based on at least one characteristic parameter of at least one dimension and the first attribute of the labeling user, and the classification of at least one target user is divided according to the classification model. Therefore, the problem that the target user cannot be accurately classified due to the fact that the first attribute is not filled in by the user or the first attribute is filled in by the user is out of date can be avoided.
The integrated module according to the embodiment of the present invention may also be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as an independent product. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for enabling a computer device (which may be a personal computer, a base station, or a network device) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes. Thus, embodiments of the invention are not limited to any specific combination of hardware and software.
This embodiment provides a specific hardware based on the above device embodiment, as shown in fig. 9, the apparatus includes a processor 92, a storage medium 94, and at least one external communication interface 91; the processor 92, storage medium 94 and external communication interface 91 are all connected by a bus 93. The processor 92 may be a microprocessor, a central processing unit, a digital signal processor, a programmable logic array, or other electronic components with processing functions. The storage medium has stored therein computer executable code.
The hardware may be the server. The processor, when executing the computer executable code, is capable of at least: acquiring at least one labeled user with a first attribute based on historical service data of social network users; acquiring at least one characteristic parameter corresponding to the labeling user from at least one dimension, and determining a classification model aiming at the first attribute of the user based on the characteristic parameter of the labeling user and the first attribute corresponding to the labeling user; and based on the classification model aiming at the first attribute of the user, classifying at least one target user in the social network into a corresponding category of the first attribute.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention.
Claims (12)
1. A method for classifying a user, the method comprising:
selecting at least one first-class initial user with a first attribute as a first class based on historical service data of users in the social network; the first attribute comprises a first category and a second category, and the first category is different from the second category; the first attribute is used for representing the love and marriage state of the user in the social network;
selecting at least one second type initial user from all users except the at least one first type initial user;
establishing a binary classification model aiming at a first attribute of a user based on the first class of initial users and the second class of initial users;
selecting at least one user with the first attribute as a user to be processed based on historical service data of the users in the social network;
classifying the users to be processed through the binary classification model to obtain a classification result aiming at the users to be processed;
selecting the user to be processed with the probability higher than a preset probability threshold value as a labeling user according to the probability that the first attribute of the user to be processed is the same as the corresponding classification result;
performing feature extraction processing on the historical service data of the labeled user to obtain a feature parameter corresponding to the labeled user, and training a classification model aiming at a first attribute of the user on the basis of the feature parameter of the labeled user and the first attribute corresponding to the labeled user;
processing the characteristic parameters of at least one target user in the social network through the classification model aiming at the first attribute of the user to obtain the category of the first attribute corresponding to the target user;
and according to the category of the first attribute corresponding to the target user, carrying out classified sending on the media information of the target user.
2. The method according to claim 1, wherein said selecting at least one second type of initial user from all users excluding said at least one first type of initial user comprises:
determining common characteristics corresponding to the first type initial users based on historical service data of the first type initial users;
and selecting at least one second type initial user with the difference value of the common characteristics of the first type initial users exceeding a preset threshold value from the social network based on the common characteristics corresponding to the first type initial users.
3. The method according to claim 1, wherein after the user to be processed with the probability higher than the preset probability threshold is selected as the labeled user, the method further comprises:
respectively acquiring historical service data corresponding to the labeled users from at least one dimension;
and screening the labeled users based on the historical service data of the at least one dimension to obtain the screened labeled users.
4. The method according to claim 1, wherein the characteristic parameters of the labeling user comprise basic attribute parameters, operation parameters and interaction characteristic parameters of the labeling user; and the interactive characteristic parameters are determined according to interactive data between the labeling users and other users except the labeling users.
5. The method of claim 4, wherein the basic attribute parameters of the annotation user comprise at least one of: logging in position information, logging in time period, group joining a preset name and interaction frequency of the group;
the operation parameters of the labeling user at least comprise: an operation frequency and an operation period for preset types of target information;
the interaction characteristic parameters of the labeling user comprise at least one of the following parameters: the gender attribute of the other users, the interaction frequency between the other users and the labeling user, and the login address information of the other users.
6. A server, comprising:
a user acquisition unit configured to:
selecting at least one first-class initial user with a first attribute as a first class based on historical service data of users in the social network; the first attribute comprises a first category and a second category, and the first category is different from the second category; the first attribute is used for representing the love and marriage state of the user in the social network;
selecting at least one second type initial user from all users except the at least one first type initial user;
establishing a binary classification model aiming at a first attribute of a user based on the first class of initial users and the second class of initial users;
selecting at least one user with the first attribute as a user to be processed based on historical service data of the users in the social network;
classifying the users to be processed through the binary classification model to obtain a classification result aiming at the users to be processed;
selecting the user to be processed with the probability higher than a preset probability threshold value as a labeling user according to the probability that the first attribute of the user to be processed is the same as the corresponding classification result;
the model establishing unit is used for carrying out feature extraction processing on the historical service data of the labeled user to obtain a feature parameter corresponding to the labeled user, and training a classification model aiming at the first attribute of the user based on the feature parameter of the labeled user and the first attribute corresponding to the labeled user;
the classification unit is used for processing the characteristic parameters of at least one target user in the social network through the classification model aiming at the first attribute of the user to obtain the category of the first attribute corresponding to the target user;
the classification unit is further configured to perform classification sending of media information on the target user according to the category of the first attribute corresponding to the target user.
7. The server according to claim 6,
the user obtaining unit is further configured to determine, based on historical service data of the first type of initial user, a common feature corresponding to the first type of initial user; and selecting at least one second type initial user with the difference value of the common characteristics of the first type initial users exceeding a preset threshold value from the social network based on the common characteristics corresponding to the first type initial users.
8. The server according to claim 6,
the user acquisition unit is further used for acquiring historical service data corresponding to the labeled user from at least one dimension; and screening the labeled users based on the historical service data of the at least one dimension to obtain the screened labeled users.
9. The server according to claim 6, wherein the characteristic parameters of the labeling user comprise basic attribute parameters, operation parameters and interaction characteristic parameters of the labeling user; and the interactive characteristic parameters are determined according to interactive data between the labeling users and other users except the labeling users.
10. The server according to claim 9, wherein the basic attribute parameters of the annotation user comprise at least one of: logging in position information, logging in time period, group joining a preset name and interaction frequency of the group;
the operation parameters of the labeling user at least comprise: an operation frequency and an operation period for preset types of target information;
the interaction characteristic parameters of the labeling user comprise at least one of the following parameters: the gender attribute of the other users, the interaction frequency between the other users and the labeling user, and the login address information of the other users.
11. A server, comprising:
a computer-readable storage medium to store executable instructions;
a processor for implementing the user classification method of any one of claims 1 to 5 when executing executable instructions stored in the computer-readable storage medium.
12. A computer-readable storage medium having stored thereon executable instructions for, when executed by a processor, implementing the user classification method of any of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201511033392.2A CN105701498B (en) | 2015-12-31 | 2015-12-31 | User classification method and server |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201511033392.2A CN105701498B (en) | 2015-12-31 | 2015-12-31 | User classification method and server |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105701498A CN105701498A (en) | 2016-06-22 |
CN105701498B true CN105701498B (en) | 2021-09-07 |
Family
ID=56226820
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201511033392.2A Active CN105701498B (en) | 2015-12-31 | 2015-12-31 | User classification method and server |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105701498B (en) |
Families Citing this family (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106204060B (en) * | 2016-06-28 | 2018-04-13 | 腾讯科技(深圳)有限公司 | The method and device that user is divided to cluster realized by computer system |
CN106875183B (en) * | 2016-06-28 | 2020-07-28 | 阿里巴巴集团控股有限公司 | Method and device for determining bank account number, identity card number and state of information to be checked |
CN106709755A (en) * | 2016-11-28 | 2017-05-24 | 加和(北京)信息科技有限公司 | Method of predicting user frequency and apparatus thereof |
CN108268511A (en) * | 2016-12-30 | 2018-07-10 | 上海互联网软件集团有限公司 | Network user classification method based on big data |
CN108268495A (en) * | 2016-12-30 | 2018-07-10 | 上海互联网软件集团有限公司 | Network user's categorizing system based on big data |
CN108280104B (en) | 2017-02-13 | 2020-06-02 | 腾讯科技(深圳)有限公司 | Method and device for extracting characteristic information of target object |
CN107240029B (en) * | 2017-05-11 | 2023-03-31 | 腾讯科技(深圳)有限公司 | Data processing method and device |
CN107330459B (en) * | 2017-06-28 | 2021-09-14 | 联想(北京)有限公司 | Data processing method and device and electronic equipment |
CN107563429B (en) * | 2017-07-27 | 2020-11-10 | 国家计算机网络与信息安全管理中心 | Method and device for classifying network user groups |
CN107392259B (en) * | 2017-08-16 | 2021-12-07 | 北京京东尚科信息技术有限公司 | Method and device for constructing unbalanced sample classification model |
CN109816134B (en) * | 2017-11-22 | 2021-07-20 | 北京京东尚科信息技术有限公司 | Method and device for predicting delivery address and storage medium |
CN108399418B (en) * | 2018-01-23 | 2021-09-03 | 北京奇艺世纪科技有限公司 | User classification method and device |
CN109063736B (en) * | 2018-06-29 | 2020-09-25 | 考拉征信服务有限公司 | Data classification method and device, electronic equipment and computer readable storage medium |
CN109492658A (en) * | 2018-09-21 | 2019-03-19 | 北京车和家信息技术有限公司 | A kind of point cloud classifications method and terminal |
CN109818782A (en) * | 2018-12-31 | 2019-05-28 | 南京红柑桔信息技术有限公司 | The method that a kind of pair of server is classified |
JP7168095B2 (en) * | 2019-08-29 | 2022-11-09 | 富士通株式会社 | PATTERN EXTRACTION PROGRAM, APPARATUS AND METHOD |
CN112468385B (en) * | 2019-09-09 | 2022-07-01 | 腾讯科技(深圳)有限公司 | Virtual grouping configuration method and device, storage medium and electronic device |
CN113934941B (en) * | 2021-10-12 | 2024-07-02 | 北京朗玛数联科技有限公司 | User recommendation system and method based on multidimensional information |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101266619A (en) * | 2008-05-12 | 2008-09-17 | 腾讯科技(深圳)有限公司 | User information excavation method and system |
CN102625940A (en) * | 2009-06-12 | 2012-08-01 | 电子湾有限公司 | Internet preference learning facility |
CN103778555A (en) * | 2014-01-21 | 2014-05-07 | 北京集奥聚合科技有限公司 | User attribute mining method and system based on user tags |
CN104298741A (en) * | 2014-10-09 | 2015-01-21 | 百度在线网络技术(北京)有限公司 | Method and device for providing push information |
CN104657369A (en) * | 2013-11-19 | 2015-05-27 | 深圳市腾讯计算机系统有限公司 | User attribute information generating method and system |
CN104718547A (en) * | 2013-10-11 | 2015-06-17 | 文化便利俱乐部株式会社 | Customer data analysis system |
CN104737565A (en) * | 2012-10-19 | 2015-06-24 | 脸谱公司 | Method relating to predicting the future state of a mobile device user |
CN104933075A (en) * | 2014-03-20 | 2015-09-23 | 百度在线网络技术(北京)有限公司 | User attribute predicting platform and method |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10089639B2 (en) * | 2013-01-23 | 2018-10-02 | [24]7.ai, Inc. | Method and apparatus for building a user profile, for personalization using interaction data, and for generating, identifying, and capturing user data across interactions using unique user identification |
US20140358630A1 (en) * | 2013-05-31 | 2014-12-04 | Thomson Licensing | Apparatus and process for conducting social media analytics |
-
2015
- 2015-12-31 CN CN201511033392.2A patent/CN105701498B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101266619A (en) * | 2008-05-12 | 2008-09-17 | 腾讯科技(深圳)有限公司 | User information excavation method and system |
CN102625940A (en) * | 2009-06-12 | 2012-08-01 | 电子湾有限公司 | Internet preference learning facility |
CN104737565A (en) * | 2012-10-19 | 2015-06-24 | 脸谱公司 | Method relating to predicting the future state of a mobile device user |
CN104718547A (en) * | 2013-10-11 | 2015-06-17 | 文化便利俱乐部株式会社 | Customer data analysis system |
CN104657369A (en) * | 2013-11-19 | 2015-05-27 | 深圳市腾讯计算机系统有限公司 | User attribute information generating method and system |
CN103778555A (en) * | 2014-01-21 | 2014-05-07 | 北京集奥聚合科技有限公司 | User attribute mining method and system based on user tags |
CN104933075A (en) * | 2014-03-20 | 2015-09-23 | 百度在线网络技术(北京)有限公司 | User attribute predicting platform and method |
CN104298741A (en) * | 2014-10-09 | 2015-01-21 | 百度在线网络技术(北京)有限公司 | Method and device for providing push information |
Non-Patent Citations (2)
Title |
---|
一个基于hadoop的并行社交网络挖掘系统;李冠辰;《软件》;20140216;第34卷(第12期);127-131 * |
几种典型数据挖掘方法及其应用研究;董彩玲;《中国优秀硕士学位论文全文数据库_信息科技辑》;20100915(第09期);I138-415 * |
Also Published As
Publication number | Publication date |
---|---|
CN105701498A (en) | 2016-06-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105701498B (en) | User classification method and server | |
CN108021929B (en) | Big data-based mobile terminal e-commerce user portrait establishing and analyzing method and system | |
US10223454B2 (en) | Image directed search | |
CN109189934B (en) | Public opinion recommendation method, public opinion recommendation device, computer equipment and storage medium | |
CN104281622B (en) | Information recommendation method and device in a kind of social media | |
CN107735782B (en) | Image and text data hierarchical classifier | |
CN104298719B (en) | Category division, advertisement placement method and the system of user is carried out based on Social behaviors | |
CN111178970B (en) | Advertisement putting method and device, electronic equipment and computer readable storage medium | |
US10637826B1 (en) | Policy compliance verification using semantic distance and nearest neighbor search of labeled content | |
US9959467B2 (en) | Image processing client | |
US11636519B2 (en) | Automated visual suggestion, generation, and assessment using computer vision detection | |
CN108960945A (en) | Method of Commodity Recommendation and device | |
CN105787133B (en) | Advertisement information filtering method and device | |
US20180068028A1 (en) | Methods and systems for identifying same users across multiple social networks | |
US9286379B2 (en) | Document quality measurement | |
CN108959323B (en) | Video classification method and device | |
US20150220786A1 (en) | Image Processing Methods | |
US20160342624A1 (en) | Image Tagging System | |
US20150112814A1 (en) | System and method for an integrated content publishing system | |
US9639867B2 (en) | Image processing system including image priority | |
US10140631B2 (en) | Image processing server | |
CN108595580B (en) | News recommendation method, device, server and storage medium | |
CN117974234A (en) | Information recommendation method and device, electronic equipment and storage medium | |
KR20170036422A (en) | Apparatus, method and computer program for providing service to share knowledge | |
CN112685618A (en) | User feature identification method and device, computing equipment and computer storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |