CN107273454B

CN107273454B - User data classification method, device, server and computer readable storage medium

Info

Publication number: CN107273454B
Application number: CN201710401985.2A
Authority: CN
Inventors: 赫南; 朱顺; 孙振鹏; 杨旭; 陈英杰; 完灏; 胡景贺; 温园旭; 李慧倩; 李婵怡
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2017-05-31
Filing date: 2017-05-31
Publication date: 2020-11-03
Anticipated expiration: 2037-05-31
Also published as: CN107273454A

Abstract

The present disclosure provides a user data classification method, including: generating characteristics of the user data; generating a labeled data set and an unlabeled data set of the user data according to the labeling rule; constructing a positive sample labeling data set P and an unknown sample data set U of one of a plurality of categories according to the labeling data set and the non-labeling data set; generating a classifier according to the positive sample labeling data set P, the unknown sample data set U and the characteristics of corresponding user data; a classifier is used to determine whether the user data in the unlabeled dataset belongs to the one category. The user data are classified through an improved formal sample-label-free learning algorithm, the method is suitable for feature extraction of crowds, and the crowds in similar life stages in the system are mined, so that accurate crowd-oriented e-commerce advertisements are provided.

Description

User data classification method, device, server and computer readable storage medium

Technical Field

The present disclosure relates to the field of internet technologies, and in particular, to a method and an apparatus for classifying user data, a server, and a computer-readable storage medium.

Background

Market researchers and sociologists have become more aware in recent years that different categories of consumers, for example in different life stages, exhibit different shopping behaviors. Some coarse-grained life stage divisions may be made for consumers, such as the learning stage (young, and single), newborns (young, and without children), middle-aged (married, and with 0 or more children), elderly (older or retired, and children living independently), and so forth. It is clear that people in different stages of life (age groups) show differentiated consumption trends. For example, pregnant women may purchase folic acid, vitamins, and mothers may purchase corresponding items such as milk powder, a baby carriage, a safety seat, a puzzle, etc., depending on the age of the baby. In the mother and infant channel of the e-commerce website and the vertical app, the consumer purchase mode is quite obvious. The life stage orientation of the consumers can be introduced into the accurate crowd orientation service and recommendation system of the E-commerce advertisement, so that a better recommendation effect can be obtained.

However, in the process of implementing the present invention, the inventors found that the prior art has at least the following technical problems: the effectiveness of the method depends on the correctness and scale of training data, and meanwhile, due to the standard characteristics of the attributes of certain commodities such as mother and infant commodities, such as milk powder, the age range can be clearly indicated, the commodities have strong crowd orientation, and the method is not necessarily suitable for recommendation application. Therefore, there is a need for a method and an apparatus for classifying users, which can better classify users, such as more accurately and reliably mine consumers with the same life stage in an e-commerce system, thereby serving precise crowd targeting of e-commerce advertisements.

Disclosure of Invention

According to a first aspect of the present disclosure, there is provided a user data classification method, the method comprising: generating characteristics of the user data; generating a labeled data set and an unlabeled data set of the user data according to the labeling rule; constructing a positive sample labeling data set P and an unknown sample data set U of one of a plurality of categories according to the labeling data set and the unlabeled data set; generating a classifier according to the positive sample labeling data set P, the unknown sample data set U and the characteristics of corresponding user data; and determining, using the classifier, whether user data in the unlabeled dataset belongs to the one category.

In one embodiment, the user data may be e-commerce user data, and the plurality of categories are a plurality of life stages, such as maternal-infant life stages.

In one embodiment, the method may further include determining whether the user data satisfies a labeling rule, and if so, adding to a labeling data set, where the labeling rule may include: if the user data indicates that only goods in one life stage are purchased, determining the purchase time as the start time of the life stage, if the user data indicates that goods in multiple life stages are purchased and purchased according to the time sequence, determining the start time of the corresponding life stage by the last purchased time, and/or if the user data indicates that goods in multiple life stages are purchased and not purchased according to the time sequence, determining the start time of the life stage by the earliest life stage. The method may further comprise determining to which life stage the user data currently belongs, according to the determined start time of life stages, the duration of each life stage and the current time.

In one embodiment, the characteristics may include category characteristics, demographic characteristics, and temporal characteristics of the purchased goods, which may include purchase time weighted characteristics and characteristics related to individual life stages.

In one embodiment, the positive sample standard dataset P may comprise user data in the labeled dataset belonging to the category, the unknown sample dataset U comprises at least a portion of the set consisting of user data in the labeled dataset not belonging to the category and user data in the unlabeled dataset, and generating the classifier may comprise the steps of:

setting the classifier M to be null and the reliable negative sample set RN to be null;

randomly sampling a part of user data S from P, adding U, updating P and U, and recording as Ps (P-S) and Us (U + S);

training the logistic regression classifier LR using Ps as positive samples and Us as negative samples_iI is 0, 1.. times, as follows

(1) Setting a classifier threshold th by using S;

(2) for each sample u ∈ Us: if at LR_iIf the result of the classifier of (a) is less than the threshold th, adding u to the RN, and Us-RN;

(3)M＝M+LR_i；

training the logistic regression classifier LR using Ps as positive samples and RN as negative samples_iRepeating the steps (1) to (3) until an iteration termination condition is met to obtain a classifier LR_last；

Using LR_lastClassify P and return LR if more than a threshold number of positive samples are determined to be negative₁As a final classifier, otherwise return LR_lastAs the final classifier.

According to a second aspect of the present disclosure, there is provided a user data classification apparatus including: a feature generation unit 701, an annotation unit 702, a sample construction unit 703, a classifier generation unit 704, and a classification unit 705. The feature generation unit 701 is configured to generate features of the user data. The annotation unit 702 is configured to generate annotated and unlabeled datasets of user data according to annotation rules. The sample construction unit 703 is configured to construct a positive sample labeled dataset P and an unknown sample dataset U of one of a plurality of classes from the labeled dataset and the unlabeled dataset. The classifier generating unit 704 is configured to generate a classifier based on the positive sample labeling dataset P and the unknown sample dataset U and the characteristics of the corresponding user data. The classification unit 705 is configured to determine, using the classifier, whether user data in an unlabeled dataset belongs to the one class.

In one embodiment, the user data may be e-commerce user data and the plurality of categories may be a plurality of life stages, such as maternal-infant life stages.

In one embodiment, the annotation unit may be further configured to determine whether the user data satisfies an annotation rule, and if so, add the annotation rule to an annotation data set, where the annotation rule includes: if the user data indicates that only goods in one life stage are purchased, determining the purchase time as the start time of the life stage, if the user data indicates that goods in multiple life stages are purchased and purchased according to the time sequence, determining the start time of the corresponding life stage by the last purchased time, and/or if the user data indicates that goods in multiple life stages are purchased and not purchased according to the time sequence, determining the start time of the life stage by the earliest life stage. The flag unit may be further configured to determine to which life stage the user data currently belongs, depending on the determined start time of the life stage, the duration of each life stage and the current time.

In one embodiment, the characteristics may include category characteristics, demographic characteristics, and temporal characteristics of the purchased goods, wherein the temporal characteristics may further include purchase time weighting characteristics and characteristics related to individual life stages.

In one embodiment, the positive sample standard dataset P may comprise user data in the annotated dataset belonging to said category, the unknown sample dataset U may comprise at least a part of the set consisting of user data in the annotated dataset not belonging to said category and user data in the unlabeled dataset, and the classifier generation unit may be further configured to:

(1) Setting a classifier threshold th by using S;

(3)M＝M+LR_i；

According to a third aspect of the present disclosure, there is provided a server comprising: one or more processors; storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to carry out the method according to the first aspect.

According to a fourth aspect of the present disclosure, there is provided a computer-readable storage medium storing computer instructions which, when executed by a computer, cause the computer to perform the method of the first aspect.

The present disclosure provides an improved user data classification method, which trains classifiers by labeling data sets and non-labeling data sets, thereby achieving more accurate classification. More specifically, life stage targeting can be introduced into the e-commerce advertisement accurate crowd targeting service, and the targeting applied to each life stage including the mother and infant life stages can be expanded, so that a better personalized recommendation effect can be provided.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent from the following description of embodiments of the present disclosure with reference to the accompanying drawings, in which:

fig. 1 is a schematic diagram illustrating a crowd mining basic flow 100 according to an embodiment of the disclosure.

2A-2D are schematic diagrams illustrating life stages for annotating user data according to embodiments of the present disclosure;

FIG. 3 is a schematic diagram illustrating a tree structured e-commerce taxonomy in accordance with an embodiment of the present disclosure;

fig. 4 is a flow chart illustrating a method of generating tags for a particular life stage according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram illustrating an ABTest tag evaluation design according to an embodiment of the present disclosure;

FIG. 6 is a flow chart illustrating a user data classification method according to an embodiment of the present disclosure;

FIG. 7 is a schematic block diagram illustrating a user data classification apparatus according to an embodiment of the present disclosure;

FIG. 8 is a schematic block diagram illustrating an exemplary system architecture 800 to which the user data classification method or user data classification apparatus of the present disclosure may be applied; and

fig. 9 is a block diagram illustrating a computer system 900 for implementing embodiments of the present disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The words "a", "an" and "the" and the like as used herein are also intended to include the meanings of "a plurality" and "the" unless the context clearly dictates otherwise. Furthermore, the terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.

Some block diagrams and/or flow diagrams are shown in the figures. It will be understood that some blocks of the block diagrams and/or flowchart illustrations, or combinations thereof, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the instructions, which execute via the processor, create means for implementing the functions/acts specified in the block diagrams and/or flowchart block or blocks.

In the following, mining or classifying user data will be described by taking mining of life stages (e.g., mother-to-baby life stages) as an example, but those skilled in the art will recognize that the present disclosure may be extended to other classifications as well.

FIG. 1 illustrates a general diagram of a crowd mining basic flow 100 according to an embodiment of the disclosure.

As shown in FIG. 1, the crowd mining basic flow 100 according to this embodiment may include annotating the data at 120. For example, user data can be obtained from the data warehouse 110, the purchasing behavior of e-commerce users analyzed, and reasonable rules defined for automated annotation to produce annotated data sets and unlabeled data sets, as described in detail later.

Additionally, the crowd mining basic flow 100 may include building features at 130. For example, user data may be obtained from data warehouse 110, and features available for training extracted from the user's purchasing behavior.

The labeled and unlabeled data sets obtained in the data labeling operation 120 and the features obtained in the feature construction operation 130 can be fed into a classifier generation model 140 to generate a classifier. For example, training features of labeled and unlabeled datasets and constituents can be used to perform both normative and unlabeled sample learning, and through this learning process, a classifier model can be generated to label a large amount of unlabeled data. Specifically, as shown in FIG. 1, a Logistic Regression (LR) classifier is iteratively generated using a sample-label-free learning algorithm. Although fig. 1 shows two LR classifiers, those skilled in the art will appreciate that this is merely an example, representing the generation of classifiers in an iterative manner.

Additionally, the resulting classifier may also be evaluated for effectiveness at 150. For example, the classification results of the classifier can be evaluated using a test set, an on-line A/B test.

The above-described respective process steps will be described in detail below. The disclosure takes the mining of mother and infant population as an example to illustrate the specific content of the disclosure. For example, the data range is 2015 for consumers to spend on the kyoto website from month 1 to month 12. For convenience of description, the mother-infant population may be divided into the following stages, and the letter Lx is used to represent different labeled populations, see the following table.

TABLE 1 maternal and infant population stage and tag values

Tag value	Life stage of mother and baby
		L0	Pregnancy
L1	Baby 0-3 months old
		L2	Baby 3-6 months old
L3	Baby 6-12 months old
		L4	Baby 12-24 months old

Annotating data

First, statistical analysis can be performed on user data to remove users who make an order exception within a period of time. For example, in the last 1 year, users with very high and very low frequency are identified as having insignificant swiping behavior or consumption characteristics.

Secondly, for the filtered user, it can be determined when a certain maternal-infant stage is entered, and then it can be determined which maternal-infant stage the user is probably currently in according to the behavior characteristics (for example, the time is continued or the role is confirmed).

In the following, it will be described how it is determined when the user enters a certain maternal-infant phase.

Some commercial products are only suitable for certain types of maternal and infant stages, for example, users at L0 are more likely to buy radioprotective clothing or folic acid. Through the staged division of the mother and infant commodities, the current mother and infant stage of the purchasing crowd can be roughly judged. The characteristic commodities and commodity attributes corresponding to different stages can be organized into a table similar to the following.

TABLE 2 characteristic commodities corresponding to different maternal and infant stages

In order to track the purchasing behavior sequence of the mother and infant commodities of the user all year round, a mother and infant stage behavior statistical table of orders can be established for each user, and the total amount of the orders purchased by each user in a certain mother and infant stage and the first and last purchasing time are recorded. The mother-infant states of some users can be preliminarily judged through the statistical data, and the statistical data can also be used as basic characteristics for subsequent model prediction.

Fig. 2A shows behavior statistics of various stages of mother and infant, which records information related to each user purchasing various stages of mother and infant commodities, such as total order amount for purchasing commodities belonging to Lx (x ═ 0, 1, 2, 3, 4), time for first purchasing commodities belonging to Lx, and time for last purchasing commodities belonging to Lx.

After generating statistics of the behaviors of the various maternal and infant stages of the user, the maternal and infant stage of the user in the purchasing behavior and the time for starting to enter the stage can be determined according to the following labeling rules:

rule one, the user orders and does not span multiple maternal and infant stages. At this time, the user has only placed an order for a commodity belonging to a certain maternal-infant stage. For example, as shown in fig. 2B, the user purchased a commodity belonging to the L4 stage (1 or more times), and did not purchase commodities of the L0-L3 stages. This case determines that the user is at stage L4 from the point in time when the order was placed earlier. Since the user has only placed orders belonging to stage L4, the user is output (L4, 2015-11-23), indicating that the user entered stage L4 starting at 11/23/2015.

Rule two, the user enters a single-span multiple maternal-infant stage, which is further subdivided into two cases:

(a) the ordering times of the multiple phases are not crossed. That is, the user purchases the products in the mother-infant stage according to the time sequence, and then the life stage corresponding to the commodity purchased at the last time is taken as the standard, and the earliest ordering time in the life stage is calculated. For example, as shown in FIG. 2C, the item that the user last placed an order is in the L4 stage, so the user is output (L4, 2015-12-21), indicating that the user entered the L4 maternal-infant stage beginning at 2015, 12/21.

(b) The ordering time of the multiple stages is crossed. That is, the user purchases goods in multiple life stages, and the corresponding life stages do not evolve in chronological order. For example, if the user (actually, it may be in the pregnancy stage) buys a baby carriage (say, for 3-6 months of use) and then buys a nipple (say, for 0-3 months of use), the user is approximately determined to be in the L1 stage (0-3 months of use) based on the earliest corresponding life stage in the purchased goods of the user, and the earliest ordering time in the L1 stage is taken as the starting time. For example, as shown in FIG. 2D, the item that the user last placed an order is in the L1 stage, so the user is output (L1, 2015-09-10), indicating that the user entered the L1 stage at 2015-09-10.

After determining when the user enters a certain maternal-infant stage, the time duration of each stage divided in advance can be used to estimate which maternal-infant stage the user entering the maternal-infant stage should be in by the current time (for example, 12 months and 31 days 2015). This portion of the data will be used to build a positive sample of classifier model training (i.e., user data labeled as belonging to a particular maternal-fetal stage), as will be described in detail later.

It should be noted that although it is described that the above annotation rules can be used to deduce which maternal and infant stage the user should be in, the above rules may only cover a part of the user data and generate annotation data. That is, the rule may not cover all user data, and at this time, the user data that the rule does not cover will form unlabeled data, and will need to be classified by the classifier in the future.

Thus, with the annotation data 120, annotated and unlabeled data sets may be generated from the user data of the data warehouse 110.

Build feature

Features need to be constructed as input to the classifier generation model 140 before training the model, and the features used may include the following groups: category characteristics, user demographic characteristics, and temporal characteristics, described separately below.

Class characteristics

Generally, each e-commerce displays goods in hierarchical categories. For example, the Jingdong displays the commodities with different attributes by using a three-level category, so that the user can quickly locate the required commodities. For example, fig. 3 illustrates a tree structured e-commerce taxonomy in accordance with an embodiment of the present disclosure. Wherein, the commodities in the Jingdong mall include: household appliances of the first category, … …, book audio-video electronic books, etc.; … … large household appliances of the second category under the household appliances, individual health care, hardware home decoration and the like; and flat televisions, air conditioners, washing machines, etc., of the third category, which are all under electric power.

The purchase of the goods by the user reflects his needs at the time or for some time in the future. For example, mothers at the beginning of pregnancy tend to buy maternity wear, radiation protective clothing, etc., while at the later stage they may buy diapers, milk powder, cribs, etc. to prepare her children for their arrival. But there is no concern as to what brand (e.g., whether it is a good fit or a queen or other brand) of diaper to buy. Therefore, the user's requirements can be described in a fine-grained manner by selecting the purchasing behavior of the user for the third-class objects, and the same type of commodities can be classified into one class.

In order to reduce the influence of popular categories, users can be regarded as documents, each category can be regarded as a word appearing in the documents, and a TF-IDF (word frequency-inverse file frequency) value of the user is calculated to construct a category feature vector.

User demographic attributes characteristics

Typically, the consumption behavior of a user is related to the demographic characteristics of the user. For example, differences in users of different ages, sex, membership grade (often representing consumption ability) of e-commerce, and the like are reflected in differences in consumption habits of the users. The present disclosure uses registered user information of an e-commerce website and shopping behaviors of users to extract features of multiple user dimensions, which are referred to as "user figures". An example of a user demographic characteristic is shown in the following table.

TABLE 3 user demographic Properties

Temporal characteristics

The temporal features may include, for example, temporal features and temporal weighted features related to various life stages (e.g., maternal-fetal stages).

Temporal characteristics associated with various life stages. For example, the purchase of maternity wear by a user a year ago and a month ago is quite different from the guessing of which maternal-infant stage the user is currently in, the latter being more likely to belong to stage L0; meanwhile, if the user purchases commodities belonging to a certain maternal-infant stage (L0) a plurality of times, it can be roughly presumed how long the user has elapsed at this stage, and a user who has experienced a 9-month gestation period is more likely to purchase commodities of the next stage (L1) than a user who has experienced a 2-month gestation period. To this end, the present disclosure presents exemplary mother and infant merchandise purchase features as shown in the following table.

TABLE 4 time characteristics of the user's purchase of various maternal and infant stage commodities

A time weighted feature. Similarly, for example, a user who purchased a good a year ago and a month ago may have a large difference in their present liveness, and the latter may be more likely to purchase the good again in a short period of time. Defining a time-weighted feature formula as follows:

wherein λ is an attenuation factor, which can be taken as 5.0/365 in the present disclosure, T is a timestamp of 31 days 12 and 12 months 2015, ti is a date timestamp of the ith order of the user, and m is the total number of orders of the user.

Finally, during training, various features are normalized and arranged into a multi-dimensional feature matrix, wherein the feature vector of each user corresponds to one row of the matrix, and the method comprises the following steps:

classifier model generation

In the present disclosure, classifier model generation may include both positive and unlabeled sample learning. The present disclosure employs a semi-supervised Learning method (PU-Learning) to achieve population expansion. As described above, only small-scale positive sample labeling data can be obtained by the labeling rule, and a reliable negative sample set cannot be labeled, so that a reliable classification model cannot be directly trained. The present disclosure solves the classification problem of only a small number of positive examples and a large number of unknown samples by applying a learning method of positive examples and no sample labels. Specifically, the present disclosure proposes an algorithm, which may be referred to as "spy techniques," that adds positive examples to unlabeled samples at a sampling rate that is a ratio of the total positive example data amount to train the model to obtain reliable negative examples.

The basic idea of the algorithm is as follows:

and (2) because no reliable negative sample exists, the initial reliable negative sample set RN is empty, part of data S is randomly extracted from the positive sample P and added into an unknown sample U to obtain Ps and Us, a label 1 is marked on the Ps, an initial logistic regression classifier is trained after a label 0 is marked on the Us, a threshold value is set by using the data set S to be the whole U classification, data W marked by the classifier 0 is added into the RN, the classifier is trained by using the Ps and the RN, the sample obtained by classification is added into the RN, and iteration is repeated until the termination condition is met. In a word, on the premise of ensuring the classification accuracy of the positive examples, the positive examples and the non-sample-labeled learning algorithm can expand the reliable negative sample set in each iteration.

Generally, the formal and non-sample-labeled learning methods are more suitable for two-class problems, while the life stage division of the mother and the baby belongs to the multi-class problem, and the one-vs-rest is adopted to convert one multi-class problem into a plurality of two-class problems.

The complete spy-technology-based formal and sample-label-free learning algorithm flow can be summarized as follows:

algorithm positive case and no-sample-label learning algorithm process

In the above algorithm, the positive and no-sample-label learning algorithms have some parameters to be set, such as the sampling rate s% and the threshold th. To make the training positive samples not too few, but to make S some programming possible to achieve a "spy" effect, the present disclosure may use a sampling rate of, for example, 15%. Ideally, the threshold th set for the model generated in each iteration is to enable the whole S data set to be correctly classified as a positive example, but due to the noise in the data, the setting of th is sufficient to ensure that the model is accurate for classifying the S data set between 80% and 100%, for example, and the threshold th is set to ensure that the accuracy of classifying the S is 95%, for example.

Fig. 4 illustrates a flow chart of a method 400 of generating a pregnancy (L0) stage tag according to an embodiment of the present disclosure. Those skilled in the art will appreciate that the same methods are applicable to the generation of stage L1 through L4 tags.

The method 400 includes starting at step 401. Then, in step 402, it is determined whether the user data satisfies the automation tagging rules. If so, a positive example of the user data, that is, the annotation data set, is obtained in step 403. At this time, the annotation data set may include user data of the respective stages L0 through L4. If no in step 402, the corresponding user data constitutes unlabeled data, i.e., the automated labeling rules cannot override the user data.

Next, in step 405, it is determined whether the regular user data obtained in step 403 belongs to stage L0, because the secondary classification will be performed using the one-vs-rest method. If so, the corresponding user data is labeled 1, and the training data set P, validation data set, and test set may be randomly generated in a 8: 1 ratio. Those skilled in the art will appreciate that the training data set, the validation data set, and the test set may be scaled to produce a more accurate and reliable classifier, and that this scaling relationship is not limited to the above. The unlabeled data generated in step 404 and the user data determined not to belong to stage L0 via step 405 may be combined to produce an unlabeled data set U in proportion. For example, data may be randomly sampled from the combined data at a 1: 10 ratio with respect to P based on the training data set P to produce an unlabeled data set U. That is, the ratio of user data for the positive and unmarked samples is 1: 10.

Furthermore, in addition to the training data set P and the unlabeled data set U, features of the user data are required as input to the sample-less and sample-less labeled learning algorithms. Thus, at step 408, features of the user data may be extracted. From the above, in step 409, classifiers are generated by a formal and no-sample-label learning algorithm, and the specific process may refer to the above-described flow.

Then, at step 410, the unlabeled data is classified using the generated classifier, and if the output at step 411 is 1, the corresponding user data is labeled with L0 at step 412. At step 413, the method 400 ends.

By repeating the similar method, all user data can be classified into various life stages.

Evaluation of Effect

In addition to cross validation evaluation of the classification model in an offline manner, the method also designs an ABtest online validation mechanism, which can validate the label quality and service index through ABtest online validation, and validate the reliability of the mining result through forecasting the consumer population of the electric commerce.

As shown in fig. 5, an ABTest tag evaluation design is shown, from the traffic side, whether a tag user is hit by exposure, where the traffic is divided into 3 sets. And A set: representing exposures participating in the experiment; and B, gathering: showing the exposure of the user with the label L to be verified when the request is made in the step A; and C, gathering: representing the exposure triggered by the orientation label L in B. Here, taking the maternal-infant population label verification as an example, to measure the value of the L0 directional label, a comparative experiment was designed as follows:

exp-base: pv (page view) samples, benchmark experiments, indicating correct use of the data of label L0;

exp-random: pv sampling, random experiments, data representing the use of random orientation label L0 (implemented by randomly selecting users u1 and u2, exchanging their orientation L0 label data, otherwise unchanged);

exp-use: pv sampling, without using the orientation tag L0, the L0 tag data on the user is manually removed.

By means of the ABTest system, on the B set, respectively comparing Exp-base with Exp-random and Exp-use, on the C set, comparing Exp-base with Exp-random, and observing advertising service indexes such as CPM (Cost Per Mille, pay for thousand shows), CTR (Click Through Rate) and the like.

Fig. 6 illustrates a user data classification method 600 according to an embodiment of the disclosure, the method 600 including: in step 601, generating characteristics of user data; in step 602, generating a labeled data set and an unlabeled data set of the user data according to the labeling rule; in step 603, constructing a positive sample labeled dataset P and an unknown sample dataset U of one of a plurality (e.g., more than 2) of classes from the labeled dataset and the unlabeled dataset; in step 604, a classifier is generated according to the positive sample label data set P, the unknown sample data set U and the characteristics of the corresponding user data; and in step 605, determining whether the user data in the unlabeled dataset belongs to the one class using the classifier.

In one embodiment, the user data may be e-commerce user data, and the plurality of categories are a plurality of life stages.

In one embodiment, the method 600 may further include determining whether the user data satisfies a labeling rule, and if so, adding to a labeling data set, where the labeling rule may include: if the user data indicates that only goods in one life stage are purchased, determining the purchase time as the start time of the life stage, if the user data indicates that goods in multiple life stages are purchased and purchased according to the time sequence, determining the start time of the corresponding life stage by the last purchased time, and/or if the user data indicates that goods in multiple life stages are purchased and not purchased according to the time sequence, determining the start time of the life stage by the earliest life stage. The method may further comprise determining to which life stage the user data currently belongs, according to the determined start time of life stages, the duration of each life stage and the current time.

(1) Setting a classifier threshold th by using S;

(3)M＝M+LR_i；

Fig. 7 shows a user data classification apparatus 700 according to an embodiment of the present disclosure. The user data classification apparatus 700 includes: a feature generation unit 701, an annotation unit 702, a sample construction unit 703, a classifier generation unit 704, and a classification unit 705. The feature generation unit 701 is configured to generate features of the user data. The annotation unit 702 is configured to generate annotated and unlabeled datasets of user data according to annotation rules. The sample construction unit 703 is configured to construct a positive sample labeled dataset P and an unknown sample dataset U of one of a plurality of classes from the labeled dataset and the unlabeled dataset. The classifier generating unit 704 is configured to generate a classifier based on the positive sample labeling dataset P and the unknown sample dataset U and the characteristics of the corresponding user data. The classification unit 705 is configured to determine, using the classifier, whether user data in an unlabeled dataset belongs to the one class.

In one embodiment, the user data may be e-commerce user data and the plurality of categories may be a plurality of life stages.

(1) Setting a classifier threshold th by using S;

(3)M＝M+LR_i；

Fig. 8 shows an exemplary system architecture 800 to which the user data classification method or user data classification apparatus of the present disclosure may be applied.

As shown in fig. 8, the system architecture 800 may include

terminal devices

801, 802, 803, a network 804, and a server 805. The network 804 serves to provide a medium for communication links between the

terminal devices

801, 802, 803 and the server 805. Network 804 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.

A user may use the

terminal devices

801, 802, 803 to interact with a server 805 over a network 804 to receive or send messages or the like. The

terminal devices

801, 802, 803 may have installed thereon various communication client applications, such as shopping-like applications, web browser applications, search-like applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only).

The

terminal devices

801, 802, 803 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 805 may be a server that provides various services, such as a back-office management server (for example only) that supports shopping-like websites browsed by users using the

terminal devices

801, 802, 803. The backend management server may analyze and perform other processing on the received data such as the product information query request, and feed back a processing result (for example, target push information, product information — just an example) to the terminal device.

It should be noted that the user data classification method provided in the embodiment of the present application may be generally executed by the server 805, and accordingly, the user data classification apparatus may be generally disposed in the server 805.

It should be understood that the number of terminal devices, networks, and servers in fig. 8 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 9, shown is a block diagram of a computer system 900 suitable for use in implementing embodiments of the present disclosure. The computer system illustrated in FIG. 9 is only one example and should not impose any limitations on the scope of use or functionality of embodiments of the disclosure.

As shown in fig. 9, the computer system 900 includes a Central Processing Unit (CPU)901 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)902 or a program loaded from a storage section 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data necessary for the operation of the system 900 are also stored. The CPU 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.

The following components are connected to the I/O interface 905: an input portion 906 including a keyboard, a mouse, and the like; an output section 907 including components such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 908 including a hard disk and the like; and a communication section 909 including a network interface card such as a LAN card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The drive 910 is also connected to the I/O interface 905 as necessary. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 910 as necessary, so that a computer program read out therefrom is mounted into the storage section 908 as necessary.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 909, and/or installed from the removable medium 911. The above-described functions defined in the system of the present disclosure are executed when the computer program is executed by a Central Processing Unit (CPU) 901.

It should be noted that the computer readable medium shown in the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Claims

1. A user data classification method, comprising:

generating characteristics of the user data, the characteristics including category characteristics, demographic characteristics, and temporal characteristics of the purchased goods, wherein the temporal characteristics include purchase time weighted characteristics and characteristics related to individual life stages, and wherein the purchase time weighted characteristics are defined as follows:

where λ is the attenuation factor, T is the time stamp, T_iA timestamp of the ith purchasing behavior of the user, and m is the total purchasing times of the user till T;

generating a labeled data set and an unlabeled data set of the user data according to the labeling rule;

constructing a positive sample labeling data set P and an unknown sample data set U of one of a plurality of categories according to the labeling data set and the unlabeled data set;

generating a classifier according to the positive sample labeling data set P, the unknown sample data set U and the characteristics of corresponding user data;

determining, using the classifier, whether user data in an unlabeled dataset belongs to the class.

2. The method of claim 1, wherein the user data is e-commerce user data and the plurality of categories are a plurality of life stages.

3. The method of claim 2, further comprising determining whether the user data satisfies a labeling rule, and if so, adding to a labeling dataset, the labeling rule comprising:

if the user data indicates that only goods of one life stage have been purchased, the purchase time is determined as the start time of the life stage,

if the user data indicates that a plurality of life stage commodities are purchased and purchased in chronological order, the time of the last purchase determines the start time of the corresponding life stage, and/or

If the user data indicates that commodities in a plurality of life stages are purchased and the commodities are not purchased according to the time sequence, determining the earliest ordering time belonging to the life stage as the start time of the life stage;

and determining which life stage the user data currently belongs to according to the determined start time of the life stages, the duration of each life stage and the current time.

4. The method according to claim 1, wherein the positive sample standard dataset P comprises user data in the labeled dataset belonging to said category, the unknown sample dataset U comprises at least a part of the set consisting of user data in the labeled dataset not belonging to said category and user data in the unlabeled dataset, and generating the classifier comprises the steps of:

training the logistic regression classifier LR using Ps as positive samples and Us as negative samples_iI is 0, 1, …, as follows

(1) Setting a classifier threshold th by using S;

(3)M＝M+LR_i；

5. A user data sorting apparatus comprising:

a feature generation unit configured to generate features of the user data, the features including category features, demographic attributes features, and temporal features of the purchased goods, wherein the temporal features include purchase time weighting features and features related to respective life stages, and wherein the purchase time weighting features are defined as follows:

the annotation unit is configured to generate an annotated data set and an unlabeled data set of the user data according to the annotation rule;

the sample construction unit is configured to construct a positive sample labeling data set P and an unknown sample data set U of one of a plurality of categories according to the labeling data set and the unlabeled data set;

the classifier generating unit is configured to generate a classifier according to the positive sample labeling data set P, the unknown sample data set U and the characteristics of the corresponding user data;

a classification unit configured to determine whether user data in an unlabeled dataset belongs to the class using the classifier.

6. The apparatus of claim 5, wherein the user data is e-commerce user data and the plurality of categories are a plurality of life stages.

7. The apparatus of claim 6, wherein the annotation unit is further configured to determine whether the user data satisfies an annotation rule, and if so, to add to an annotated data set, the annotation rule comprising:

wherein the annotation unit is further configured to determine to which life stage the user data currently belongs, depending on the determined start time of the life stage, the duration of each life stage and the current time.

8. The apparatus according to claim 5, wherein the positive sample standard dataset P comprises user data in the labeled dataset belonging to said category, the unknown sample dataset U comprises at least a part of the set consisting of user data in the labeled dataset not belonging to said category and user data in the unlabeled dataset, and the classifier generation unit is further configured to:

(1) Setting a classifier threshold th by using S;

(3)M＝M+LR_i；

Using LR_lastTo P proceedClassification, if more than a certain threshold number of positive samples are determined to be negative, LR is returned₁As a final classifier, otherwise return LR_lastAs the final classifier.

9. A server, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-4.

10. A computer readable storage medium storing computer instructions which, when executed by a computer, cause the computer to perform the method of any of claims 1 to 4.