CN107273454B - User data classification method, device, server and computer readable storage medium - Google Patents

User data classification method, device, server and computer readable storage medium Download PDF

Info

Publication number
CN107273454B
CN107273454B CN201710401985.2A CN201710401985A CN107273454B CN 107273454 B CN107273454 B CN 107273454B CN 201710401985 A CN201710401985 A CN 201710401985A CN 107273454 B CN107273454 B CN 107273454B
Authority
CN
China
Prior art keywords
user data
classifier
data set
user
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710401985.2A
Other languages
Chinese (zh)
Other versions
CN107273454A (en
Inventor
赫南
朱顺
孙振鹏
杨旭
陈英杰
完灏
胡景贺
温园旭
李慧倩
李婵怡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201710401985.2A priority Critical patent/CN107273454B/en
Publication of CN107273454A publication Critical patent/CN107273454A/en
Application granted granted Critical
Publication of CN107273454B publication Critical patent/CN107273454B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data

Abstract

The present disclosure provides a user data classification method, including: generating characteristics of the user data; generating a labeled data set and an unlabeled data set of the user data according to the labeling rule; constructing a positive sample labeling data set P and an unknown sample data set U of one of a plurality of categories according to the labeling data set and the non-labeling data set; generating a classifier according to the positive sample labeling data set P, the unknown sample data set U and the characteristics of corresponding user data; a classifier is used to determine whether the user data in the unlabeled dataset belongs to the one category. The user data are classified through an improved formal sample-label-free learning algorithm, the method is suitable for feature extraction of crowds, and the crowds in similar life stages in the system are mined, so that accurate crowd-oriented e-commerce advertisements are provided.

Description

User data classification method, device, server and computer readable storage medium
Technical Field
The present disclosure relates to the field of internet technologies, and in particular, to a method and an apparatus for classifying user data, a server, and a computer-readable storage medium.
Background
Market researchers and sociologists have become more aware in recent years that different categories of consumers, for example in different life stages, exhibit different shopping behaviors. Some coarse-grained life stage divisions may be made for consumers, such as the learning stage (young, and single), newborns (young, and without children), middle-aged (married, and with 0 or more children), elderly (older or retired, and children living independently), and so forth. It is clear that people in different stages of life (age groups) show differentiated consumption trends. For example, pregnant women may purchase folic acid, vitamins, and mothers may purchase corresponding items such as milk powder, a baby carriage, a safety seat, a puzzle, etc., depending on the age of the baby. In the mother and infant channel of the e-commerce website and the vertical app, the consumer purchase mode is quite obvious. The life stage orientation of the consumers can be introduced into the accurate crowd orientation service and recommendation system of the E-commerce advertisement, so that a better recommendation effect can be obtained.
However, in the process of implementing the present invention, the inventors found that the prior art has at least the following technical problems: the effectiveness of the method depends on the correctness and scale of training data, and meanwhile, due to the standard characteristics of the attributes of certain commodities such as mother and infant commodities, such as milk powder, the age range can be clearly indicated, the commodities have strong crowd orientation, and the method is not necessarily suitable for recommendation application. Therefore, there is a need for a method and an apparatus for classifying users, which can better classify users, such as more accurately and reliably mine consumers with the same life stage in an e-commerce system, thereby serving precise crowd targeting of e-commerce advertisements.
Disclosure of Invention
According to a first aspect of the present disclosure, there is provided a user data classification method, the method comprising: generating characteristics of the user data; generating a labeled data set and an unlabeled data set of the user data according to the labeling rule; constructing a positive sample labeling data set P and an unknown sample data set U of one of a plurality of categories according to the labeling data set and the unlabeled data set; generating a classifier according to the positive sample labeling data set P, the unknown sample data set U and the characteristics of corresponding user data; and determining, using the classifier, whether user data in the unlabeled dataset belongs to the one category.
In one embodiment, the user data may be e-commerce user data, and the plurality of categories are a plurality of life stages, such as maternal-infant life stages.
In one embodiment, the method may further include determining whether the user data satisfies a labeling rule, and if so, adding to a labeling data set, where the labeling rule may include: if the user data indicates that only goods in one life stage are purchased, determining the purchase time as the start time of the life stage, if the user data indicates that goods in multiple life stages are purchased and purchased according to the time sequence, determining the start time of the corresponding life stage by the last purchased time, and/or if the user data indicates that goods in multiple life stages are purchased and not purchased according to the time sequence, determining the start time of the life stage by the earliest life stage. The method may further comprise determining to which life stage the user data currently belongs, according to the determined start time of life stages, the duration of each life stage and the current time.
In one embodiment, the characteristics may include category characteristics, demographic characteristics, and temporal characteristics of the purchased goods, which may include purchase time weighted characteristics and characteristics related to individual life stages.
In one embodiment, the positive sample standard dataset P may comprise user data in the labeled dataset belonging to the category, the unknown sample dataset U comprises at least a portion of the set consisting of user data in the labeled dataset not belonging to the category and user data in the unlabeled dataset, and generating the classifier may comprise the steps of:
setting the classifier M to be null and the reliable negative sample set RN to be null;
randomly sampling a part of user data S from P, adding U, updating P and U, and recording as Ps (P-S) and Us (U + S);
training the logistic regression classifier LR using Ps as positive samples and Us as negative samplesiI is 0, 1.. times, as follows
(1) Setting a classifier threshold th by using S;
(2) for each sample u ∈ Us: if at LRiIf the result of the classifier of (a) is less than the threshold th, adding u to the RN, and Us-RN;
(3)M=M+LRi
training the logistic regression classifier LR using Ps as positive samples and RN as negative samplesiRepeating the steps (1) to (3) until an iteration termination condition is met to obtain a classifier LRlast
Using LRlastClassify P and return LR if more than a threshold number of positive samples are determined to be negative1As a final classifier, otherwise return LRlastAs the final classifier.
According to a second aspect of the present disclosure, there is provided a user data classification apparatus including: a feature generation unit 701, an annotation unit 702, a sample construction unit 703, a classifier generation unit 704, and a classification unit 705. The feature generation unit 701 is configured to generate features of the user data. The annotation unit 702 is configured to generate annotated and unlabeled datasets of user data according to annotation rules. The sample construction unit 703 is configured to construct a positive sample labeled dataset P and an unknown sample dataset U of one of a plurality of classes from the labeled dataset and the unlabeled dataset. The classifier generating unit 704 is configured to generate a classifier based on the positive sample labeling dataset P and the unknown sample dataset U and the characteristics of the corresponding user data. The classification unit 705 is configured to determine, using the classifier, whether user data in an unlabeled dataset belongs to the one class.
In one embodiment, the user data may be e-commerce user data and the plurality of categories may be a plurality of life stages, such as maternal-infant life stages.
In one embodiment, the annotation unit may be further configured to determine whether the user data satisfies an annotation rule, and if so, add the annotation rule to an annotation data set, where the annotation rule includes: if the user data indicates that only goods in one life stage are purchased, determining the purchase time as the start time of the life stage, if the user data indicates that goods in multiple life stages are purchased and purchased according to the time sequence, determining the start time of the corresponding life stage by the last purchased time, and/or if the user data indicates that goods in multiple life stages are purchased and not purchased according to the time sequence, determining the start time of the life stage by the earliest life stage. The flag unit may be further configured to determine to which life stage the user data currently belongs, depending on the determined start time of the life stage, the duration of each life stage and the current time.
In one embodiment, the characteristics may include category characteristics, demographic characteristics, and temporal characteristics of the purchased goods, wherein the temporal characteristics may further include purchase time weighting characteristics and characteristics related to individual life stages.
In one embodiment, the positive sample standard dataset P may comprise user data in the annotated dataset belonging to said category, the unknown sample dataset U may comprise at least a part of the set consisting of user data in the annotated dataset not belonging to said category and user data in the unlabeled dataset, and the classifier generation unit may be further configured to:
setting the classifier M to be null and the reliable negative sample set RN to be null;
randomly sampling a part of user data S from P, adding U, updating P and U, and recording as Ps (P-S) and Us (U + S);
training the logistic regression classifier LR using Ps as positive samples and Us as negative samplesiI is 0, 1.. times, as follows
(1) Setting a classifier threshold th by using S;
(2) for each sample u ∈ Us: if at LRiIf the result of the classifier of (a) is less than the threshold th, adding u to the RN, and Us-RN;
(3)M=M+LRi
training the logistic regression classifier LR using Ps as positive samples and RN as negative samplesiRepeating the steps (1) to (3) until an iteration termination condition is met to obtain a classifier LRlast
Using LRlastClassify P and return LR if more than a threshold number of positive samples are determined to be negative1As a final classifier, otherwise return LRlastAs the final classifier.
According to a third aspect of the present disclosure, there is provided a server comprising: one or more processors; storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to carry out the method according to the first aspect.
According to a fourth aspect of the present disclosure, there is provided a computer-readable storage medium storing computer instructions which, when executed by a computer, cause the computer to perform the method of the first aspect.
The present disclosure provides an improved user data classification method, which trains classifiers by labeling data sets and non-labeling data sets, thereby achieving more accurate classification. More specifically, life stage targeting can be introduced into the e-commerce advertisement accurate crowd targeting service, and the targeting applied to each life stage including the mother and infant life stages can be expanded, so that a better personalized recommendation effect can be provided.
Drawings
The above and other objects, features and advantages of the present disclosure will become more apparent from the following description of embodiments of the present disclosure with reference to the accompanying drawings, in which:
fig. 1 is a schematic diagram illustrating a crowd mining basic flow 100 according to an embodiment of the disclosure.
2A-2D are schematic diagrams illustrating life stages for annotating user data according to embodiments of the present disclosure;
FIG. 3 is a schematic diagram illustrating a tree structured e-commerce taxonomy in accordance with an embodiment of the present disclosure;
fig. 4 is a flow chart illustrating a method of generating tags for a particular life stage according to an embodiment of the present disclosure;
FIG. 5 is a schematic diagram illustrating an ABTest tag evaluation design according to an embodiment of the present disclosure;
FIG. 6 is a flow chart illustrating a user data classification method according to an embodiment of the present disclosure;
FIG. 7 is a schematic block diagram illustrating a user data classification apparatus according to an embodiment of the present disclosure;
FIG. 8 is a schematic block diagram illustrating an exemplary system architecture 800 to which the user data classification method or user data classification apparatus of the present disclosure may be applied; and
fig. 9 is a block diagram illustrating a computer system 900 for implementing embodiments of the present disclosure.
Detailed Description
Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The words "a", "an" and "the" and the like as used herein are also intended to include the meanings of "a plurality" and "the" unless the context clearly dictates otherwise. Furthermore, the terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.
All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.
Some block diagrams and/or flow diagrams are shown in the figures. It will be understood that some blocks of the block diagrams and/or flowchart illustrations, or combinations thereof, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the instructions, which execute via the processor, create means for implementing the functions/acts specified in the block diagrams and/or flowchart block or blocks.
In the following, mining or classifying user data will be described by taking mining of life stages (e.g., mother-to-baby life stages) as an example, but those skilled in the art will recognize that the present disclosure may be extended to other classifications as well.
FIG. 1 illustrates a general diagram of a crowd mining basic flow 100 according to an embodiment of the disclosure.
As shown in FIG. 1, the crowd mining basic flow 100 according to this embodiment may include annotating the data at 120. For example, user data can be obtained from the data warehouse 110, the purchasing behavior of e-commerce users analyzed, and reasonable rules defined for automated annotation to produce annotated data sets and unlabeled data sets, as described in detail later.
Additionally, the crowd mining basic flow 100 may include building features at 130. For example, user data may be obtained from data warehouse 110, and features available for training extracted from the user's purchasing behavior.
The labeled and unlabeled data sets obtained in the data labeling operation 120 and the features obtained in the feature construction operation 130 can be fed into a classifier generation model 140 to generate a classifier. For example, training features of labeled and unlabeled datasets and constituents can be used to perform both normative and unlabeled sample learning, and through this learning process, a classifier model can be generated to label a large amount of unlabeled data. Specifically, as shown in FIG. 1, a Logistic Regression (LR) classifier is iteratively generated using a sample-label-free learning algorithm. Although fig. 1 shows two LR classifiers, those skilled in the art will appreciate that this is merely an example, representing the generation of classifiers in an iterative manner.
Additionally, the resulting classifier may also be evaluated for effectiveness at 150. For example, the classification results of the classifier can be evaluated using a test set, an on-line A/B test.
The above-described respective process steps will be described in detail below. The disclosure takes the mining of mother and infant population as an example to illustrate the specific content of the disclosure. For example, the data range is 2015 for consumers to spend on the kyoto website from month 1 to month 12. For convenience of description, the mother-infant population may be divided into the following stages, and the letter Lx is used to represent different labeled populations, see the following table.
TABLE 1 maternal and infant population stage and tag values
Tag value Life stage of mother and baby
L0 Pregnancy
L1 Baby 0-3 months old
L2 Baby 3-6 months old
L3 Baby 6-12 months old
L4 Baby 12-24 months old
Annotating data
First, statistical analysis can be performed on user data to remove users who make an order exception within a period of time. For example, in the last 1 year, users with very high and very low frequency are identified as having insignificant swiping behavior or consumption characteristics.
Secondly, for the filtered user, it can be determined when a certain maternal-infant stage is entered, and then it can be determined which maternal-infant stage the user is probably currently in according to the behavior characteristics (for example, the time is continued or the role is confirmed).
In the following, it will be described how it is determined when the user enters a certain maternal-infant phase.
Some commercial products are only suitable for certain types of maternal and infant stages, for example, users at L0 are more likely to buy radioprotective clothing or folic acid. Through the staged division of the mother and infant commodities, the current mother and infant stage of the purchasing crowd can be roughly judged. The characteristic commodities and commodity attributes corresponding to different stages can be organized into a table similar to the following.
TABLE 2 characteristic commodities corresponding to different maternal and infant stages
Figure BDA0001308728670000081
In order to track the purchasing behavior sequence of the mother and infant commodities of the user all year round, a mother and infant stage behavior statistical table of orders can be established for each user, and the total amount of the orders purchased by each user in a certain mother and infant stage and the first and last purchasing time are recorded. The mother-infant states of some users can be preliminarily judged through the statistical data, and the statistical data can also be used as basic characteristics for subsequent model prediction.
Fig. 2A shows behavior statistics of various stages of mother and infant, which records information related to each user purchasing various stages of mother and infant commodities, such as total order amount for purchasing commodities belonging to Lx (x ═ 0, 1, 2, 3, 4), time for first purchasing commodities belonging to Lx, and time for last purchasing commodities belonging to Lx.
After generating statistics of the behaviors of the various maternal and infant stages of the user, the maternal and infant stage of the user in the purchasing behavior and the time for starting to enter the stage can be determined according to the following labeling rules:
rule one, the user orders and does not span multiple maternal and infant stages. At this time, the user has only placed an order for a commodity belonging to a certain maternal-infant stage. For example, as shown in fig. 2B, the user purchased a commodity belonging to the L4 stage (1 or more times), and did not purchase commodities of the L0-L3 stages. This case determines that the user is at stage L4 from the point in time when the order was placed earlier. Since the user has only placed orders belonging to stage L4, the user is output (L4, 2015-11-23), indicating that the user entered stage L4 starting at 11/23/2015.
Rule two, the user enters a single-span multiple maternal-infant stage, which is further subdivided into two cases:
(a) the ordering times of the multiple phases are not crossed. That is, the user purchases the products in the mother-infant stage according to the time sequence, and then the life stage corresponding to the commodity purchased at the last time is taken as the standard, and the earliest ordering time in the life stage is calculated. For example, as shown in FIG. 2C, the item that the user last placed an order is in the L4 stage, so the user is output (L4, 2015-12-21), indicating that the user entered the L4 maternal-infant stage beginning at 2015, 12/21.
(b) The ordering time of the multiple stages is crossed. That is, the user purchases goods in multiple life stages, and the corresponding life stages do not evolve in chronological order. For example, if the user (actually, it may be in the pregnancy stage) buys a baby carriage (say, for 3-6 months of use) and then buys a nipple (say, for 0-3 months of use), the user is approximately determined to be in the L1 stage (0-3 months of use) based on the earliest corresponding life stage in the purchased goods of the user, and the earliest ordering time in the L1 stage is taken as the starting time. For example, as shown in FIG. 2D, the item that the user last placed an order is in the L1 stage, so the user is output (L1, 2015-09-10), indicating that the user entered the L1 stage at 2015-09-10.
After determining when the user enters a certain maternal-infant stage, the time duration of each stage divided in advance can be used to estimate which maternal-infant stage the user entering the maternal-infant stage should be in by the current time (for example, 12 months and 31 days 2015). This portion of the data will be used to build a positive sample of classifier model training (i.e., user data labeled as belonging to a particular maternal-fetal stage), as will be described in detail later.
It should be noted that although it is described that the above annotation rules can be used to deduce which maternal and infant stage the user should be in, the above rules may only cover a part of the user data and generate annotation data. That is, the rule may not cover all user data, and at this time, the user data that the rule does not cover will form unlabeled data, and will need to be classified by the classifier in the future.
Thus, with the annotation data 120, annotated and unlabeled data sets may be generated from the user data of the data warehouse 110.
Build feature
Features need to be constructed as input to the classifier generation model 140 before training the model, and the features used may include the following groups: category characteristics, user demographic characteristics, and temporal characteristics, described separately below.
Class characteristics
Generally, each e-commerce displays goods in hierarchical categories. For example, the Jingdong displays the commodities with different attributes by using a three-level category, so that the user can quickly locate the required commodities. For example, fig. 3 illustrates a tree structured e-commerce taxonomy in accordance with an embodiment of the present disclosure. Wherein, the commodities in the Jingdong mall include: household appliances of the first category, … …, book audio-video electronic books, etc.; … … large household appliances of the second category under the household appliances, individual health care, hardware home decoration and the like; and flat televisions, air conditioners, washing machines, etc., of the third category, which are all under electric power.
The purchase of the goods by the user reflects his needs at the time or for some time in the future. For example, mothers at the beginning of pregnancy tend to buy maternity wear, radiation protective clothing, etc., while at the later stage they may buy diapers, milk powder, cribs, etc. to prepare her children for their arrival. But there is no concern as to what brand (e.g., whether it is a good fit or a queen or other brand) of diaper to buy. Therefore, the user's requirements can be described in a fine-grained manner by selecting the purchasing behavior of the user for the third-class objects, and the same type of commodities can be classified into one class.
In order to reduce the influence of popular categories, users can be regarded as documents, each category can be regarded as a word appearing in the documents, and a TF-IDF (word frequency-inverse file frequency) value of the user is calculated to construct a category feature vector.
User demographic attributes characteristics
Typically, the consumption behavior of a user is related to the demographic characteristics of the user. For example, differences in users of different ages, sex, membership grade (often representing consumption ability) of e-commerce, and the like are reflected in differences in consumption habits of the users. The present disclosure uses registered user information of an e-commerce website and shopping behaviors of users to extract features of multiple user dimensions, which are referred to as "user figures". An example of a user demographic characteristic is shown in the following table.
TABLE 3 user demographic Properties
Figure BDA0001308728670000111
Temporal characteristics
The temporal features may include, for example, temporal features and temporal weighted features related to various life stages (e.g., maternal-fetal stages).
Temporal characteristics associated with various life stages. For example, the purchase of maternity wear by a user a year ago and a month ago is quite different from the guessing of which maternal-infant stage the user is currently in, the latter being more likely to belong to stage L0; meanwhile, if the user purchases commodities belonging to a certain maternal-infant stage (L0) a plurality of times, it can be roughly presumed how long the user has elapsed at this stage, and a user who has experienced a 9-month gestation period is more likely to purchase commodities of the next stage (L1) than a user who has experienced a 2-month gestation period. To this end, the present disclosure presents exemplary mother and infant merchandise purchase features as shown in the following table.
TABLE 4 time characteristics of the user's purchase of various maternal and infant stage commodities
Figure BDA0001308728670000121
A time weighted feature. Similarly, for example, a user who purchased a good a year ago and a month ago may have a large difference in their present liveness, and the latter may be more likely to purchase the good again in a short period of time. Defining a time-weighted feature formula as follows:
Figure BDA0001308728670000122
wherein λ is an attenuation factor, which can be taken as 5.0/365 in the present disclosure, T is a timestamp of 31 days 12 and 12 months 2015, ti is a date timestamp of the ith order of the user, and m is the total number of orders of the user.
Finally, during training, various features are normalized and arranged into a multi-dimensional feature matrix, wherein the feature vector of each user corresponds to one row of the matrix, and the method comprises the following steps:
Figure BDA0001308728670000131
classifier model generation
In the present disclosure, classifier model generation may include both positive and unlabeled sample learning. The present disclosure employs a semi-supervised Learning method (PU-Learning) to achieve population expansion. As described above, only small-scale positive sample labeling data can be obtained by the labeling rule, and a reliable negative sample set cannot be labeled, so that a reliable classification model cannot be directly trained. The present disclosure solves the classification problem of only a small number of positive examples and a large number of unknown samples by applying a learning method of positive examples and no sample labels. Specifically, the present disclosure proposes an algorithm, which may be referred to as "spy techniques," that adds positive examples to unlabeled samples at a sampling rate that is a ratio of the total positive example data amount to train the model to obtain reliable negative examples.
The basic idea of the algorithm is as follows:
and (2) because no reliable negative sample exists, the initial reliable negative sample set RN is empty, part of data S is randomly extracted from the positive sample P and added into an unknown sample U to obtain Ps and Us, a label 1 is marked on the Ps, an initial logistic regression classifier is trained after a label 0 is marked on the Us, a threshold value is set by using the data set S to be the whole U classification, data W marked by the classifier 0 is added into the RN, the classifier is trained by using the Ps and the RN, the sample obtained by classification is added into the RN, and iteration is repeated until the termination condition is met. In a word, on the premise of ensuring the classification accuracy of the positive examples, the positive examples and the non-sample-labeled learning algorithm can expand the reliable negative sample set in each iteration.
Generally, the formal and non-sample-labeled learning methods are more suitable for two-class problems, while the life stage division of the mother and the baby belongs to the multi-class problem, and the one-vs-rest is adopted to convert one multi-class problem into a plurality of two-class problems.
The complete spy-technology-based formal and sample-label-free learning algorithm flow can be summarized as follows:
algorithm positive case and no-sample-label learning algorithm process
Figure BDA0001308728670000141
Figure BDA0001308728670000151
In the above algorithm, the positive and no-sample-label learning algorithms have some parameters to be set, such as the sampling rate s% and the threshold th. To make the training positive samples not too few, but to make S some programming possible to achieve a "spy" effect, the present disclosure may use a sampling rate of, for example, 15%. Ideally, the threshold th set for the model generated in each iteration is to enable the whole S data set to be correctly classified as a positive example, but due to the noise in the data, the setting of th is sufficient to ensure that the model is accurate for classifying the S data set between 80% and 100%, for example, and the threshold th is set to ensure that the accuracy of classifying the S is 95%, for example.
Fig. 4 illustrates a flow chart of a method 400 of generating a pregnancy (L0) stage tag according to an embodiment of the present disclosure. Those skilled in the art will appreciate that the same methods are applicable to the generation of stage L1 through L4 tags.
The method 400 includes starting at step 401. Then, in step 402, it is determined whether the user data satisfies the automation tagging rules. If so, a positive example of the user data, that is, the annotation data set, is obtained in step 403. At this time, the annotation data set may include user data of the respective stages L0 through L4. If no in step 402, the corresponding user data constitutes unlabeled data, i.e., the automated labeling rules cannot override the user data.
Next, in step 405, it is determined whether the regular user data obtained in step 403 belongs to stage L0, because the secondary classification will be performed using the one-vs-rest method. If so, the corresponding user data is labeled 1, and the training data set P, validation data set, and test set may be randomly generated in a 8: 1 ratio. Those skilled in the art will appreciate that the training data set, the validation data set, and the test set may be scaled to produce a more accurate and reliable classifier, and that this scaling relationship is not limited to the above. The unlabeled data generated in step 404 and the user data determined not to belong to stage L0 via step 405 may be combined to produce an unlabeled data set U in proportion. For example, data may be randomly sampled from the combined data at a 1: 10 ratio with respect to P based on the training data set P to produce an unlabeled data set U. That is, the ratio of user data for the positive and unmarked samples is 1: 10.
Furthermore, in addition to the training data set P and the unlabeled data set U, features of the user data are required as input to the sample-less and sample-less labeled learning algorithms. Thus, at step 408, features of the user data may be extracted. From the above, in step 409, classifiers are generated by a formal and no-sample-label learning algorithm, and the specific process may refer to the above-described flow.
Then, at step 410, the unlabeled data is classified using the generated classifier, and if the output at step 411 is 1, the corresponding user data is labeled with L0 at step 412. At step 413, the method 400 ends.
By repeating the similar method, all user data can be classified into various life stages.
Evaluation of Effect
In addition to cross validation evaluation of the classification model in an offline manner, the method also designs an ABtest online validation mechanism, which can validate the label quality and service index through ABtest online validation, and validate the reliability of the mining result through forecasting the consumer population of the electric commerce.
As shown in fig. 5, an ABTest tag evaluation design is shown, from the traffic side, whether a tag user is hit by exposure, where the traffic is divided into 3 sets. And A set: representing exposures participating in the experiment; and B, gathering: showing the exposure of the user with the label L to be verified when the request is made in the step A; and C, gathering: representing the exposure triggered by the orientation label L in B. Here, taking the maternal-infant population label verification as an example, to measure the value of the L0 directional label, a comparative experiment was designed as follows:
exp-base: pv (page view) samples, benchmark experiments, indicating correct use of the data of label L0;
exp-random: pv sampling, random experiments, data representing the use of random orientation label L0 (implemented by randomly selecting users u1 and u2, exchanging their orientation L0 label data, otherwise unchanged);
exp-use: pv sampling, without using the orientation tag L0, the L0 tag data on the user is manually removed.
By means of the ABTest system, on the B set, respectively comparing Exp-base with Exp-random and Exp-use, on the C set, comparing Exp-base with Exp-random, and observing advertising service indexes such as CPM (Cost Per Mille, pay for thousand shows), CTR (Click Through Rate) and the like.
Fig. 6 illustrates a user data classification method 600 according to an embodiment of the disclosure, the method 600 including: in step 601, generating characteristics of user data; in step 602, generating a labeled data set and an unlabeled data set of the user data according to the labeling rule; in step 603, constructing a positive sample labeled dataset P and an unknown sample dataset U of one of a plurality (e.g., more than 2) of classes from the labeled dataset and the unlabeled dataset; in step 604, a classifier is generated according to the positive sample label data set P, the unknown sample data set U and the characteristics of the corresponding user data; and in step 605, determining whether the user data in the unlabeled dataset belongs to the one class using the classifier.
In one embodiment, the user data may be e-commerce user data, and the plurality of categories are a plurality of life stages.
In one embodiment, the method 600 may further include determining whether the user data satisfies a labeling rule, and if so, adding to a labeling data set, where the labeling rule may include: if the user data indicates that only goods in one life stage are purchased, determining the purchase time as the start time of the life stage, if the user data indicates that goods in multiple life stages are purchased and purchased according to the time sequence, determining the start time of the corresponding life stage by the last purchased time, and/or if the user data indicates that goods in multiple life stages are purchased and not purchased according to the time sequence, determining the start time of the life stage by the earliest life stage. The method may further comprise determining to which life stage the user data currently belongs, according to the determined start time of life stages, the duration of each life stage and the current time.
In one embodiment, the characteristics may include category characteristics, demographic characteristics, and temporal characteristics of the purchased goods, which may include purchase time weighted characteristics and characteristics related to individual life stages.
In one embodiment, the positive sample standard dataset P may comprise user data in the labeled dataset belonging to the category, the unknown sample dataset U comprises at least a portion of the set consisting of user data in the labeled dataset not belonging to the category and user data in the unlabeled dataset, and generating the classifier may comprise the steps of:
setting the classifier M to be null and the reliable negative sample set RN to be null;
randomly sampling a part of user data S from P, adding U, updating P and U, and recording as Ps (P-S) and Us (U + S);
training the logistic regression classifier LR using Ps as positive samples and Us as negative samplesiI is 0, 1.. times, as follows
(1) Setting a classifier threshold th by using S;
(2) for each sample u ∈ Us: if at LRiIf the result of the classifier of (a) is less than the threshold th, adding u to the RN, and Us-RN;
(3)M=M+LRi
training the logistic regression classifier LR using Ps as positive samples and RN as negative samplesiRepeating the steps (1) to (3) until an iteration termination condition is met to obtain a classifier LRlast
Using LRlastClassify P and return LR if more than a threshold number of positive samples are determined to be negative1As a final classifier, otherwise return LRlastAs the final classifier.
Fig. 7 shows a user data classification apparatus 700 according to an embodiment of the present disclosure. The user data classification apparatus 700 includes: a feature generation unit 701, an annotation unit 702, a sample construction unit 703, a classifier generation unit 704, and a classification unit 705. The feature generation unit 701 is configured to generate features of the user data. The annotation unit 702 is configured to generate annotated and unlabeled datasets of user data according to annotation rules. The sample construction unit 703 is configured to construct a positive sample labeled dataset P and an unknown sample dataset U of one of a plurality of classes from the labeled dataset and the unlabeled dataset. The classifier generating unit 704 is configured to generate a classifier based on the positive sample labeling dataset P and the unknown sample dataset U and the characteristics of the corresponding user data. The classification unit 705 is configured to determine, using the classifier, whether user data in an unlabeled dataset belongs to the one class.
In one embodiment, the user data may be e-commerce user data and the plurality of categories may be a plurality of life stages.
In one embodiment, the annotation unit may be further configured to determine whether the user data satisfies an annotation rule, and if so, add the annotation rule to an annotation data set, where the annotation rule includes: if the user data indicates that only goods in one life stage are purchased, determining the purchase time as the start time of the life stage, if the user data indicates that goods in multiple life stages are purchased and purchased according to the time sequence, determining the start time of the corresponding life stage by the last purchased time, and/or if the user data indicates that goods in multiple life stages are purchased and not purchased according to the time sequence, determining the start time of the life stage by the earliest life stage. The flag unit may be further configured to determine to which life stage the user data currently belongs, depending on the determined start time of the life stage, the duration of each life stage and the current time.
In one embodiment, the characteristics may include category characteristics, demographic characteristics, and temporal characteristics of the purchased goods, wherein the temporal characteristics may further include purchase time weighting characteristics and characteristics related to individual life stages.
In one embodiment, the positive sample standard dataset P may comprise user data in the annotated dataset belonging to said category, the unknown sample dataset U may comprise at least a part of the set consisting of user data in the annotated dataset not belonging to said category and user data in the unlabeled dataset, and the classifier generation unit may be further configured to:
setting the classifier M to be null and the reliable negative sample set RN to be null;
randomly sampling a part of user data S from P, adding U, updating P and U, and recording as Ps (P-S) and Us (U + S);
training the logistic regression classifier LR using Ps as positive samples and Us as negative samplesiI is 0, 1.. times, as follows
(1) Setting a classifier threshold th by using S;
(2) for each sample u ∈ Us: if at LRiIf the result of the classifier of (a) is less than the threshold th, adding u to the RN, and Us-RN;
(3)M=M+LRi
training the logistic regression classifier LR using Ps as positive samples and RN as negative samplesiRepeating the steps (1) to (3) until an iteration termination condition is met to obtain a classifier LRlast
Using LRlastClassify P and return LR if more than a threshold number of positive samples are determined to be negative1As a final classifier, otherwise return LRlastAs the final classifier.
Fig. 8 shows an exemplary system architecture 800 to which the user data classification method or user data classification apparatus of the present disclosure may be applied.
As shown in fig. 8, the system architecture 800 may include terminal devices 801, 802, 803, a network 804, and a server 805. The network 804 serves to provide a medium for communication links between the terminal devices 801, 802, 803 and the server 805. Network 804 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.
A user may use the terminal devices 801, 802, 803 to interact with a server 805 over a network 804 to receive or send messages or the like. The terminal devices 801, 802, 803 may have installed thereon various communication client applications, such as shopping-like applications, web browser applications, search-like applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only).
The terminal devices 801, 802, 803 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 805 may be a server that provides various services, such as a back-office management server (for example only) that supports shopping-like websites browsed by users using the terminal devices 801, 802, 803. The backend management server may analyze and perform other processing on the received data such as the product information query request, and feed back a processing result (for example, target push information, product information — just an example) to the terminal device.
It should be noted that the user data classification method provided in the embodiment of the present application may be generally executed by the server 805, and accordingly, the user data classification apparatus may be generally disposed in the server 805.
It should be understood that the number of terminal devices, networks, and servers in fig. 8 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 9, shown is a block diagram of a computer system 900 suitable for use in implementing embodiments of the present disclosure. The computer system illustrated in FIG. 9 is only one example and should not impose any limitations on the scope of use or functionality of embodiments of the disclosure.
As shown in fig. 9, the computer system 900 includes a Central Processing Unit (CPU)901 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)902 or a program loaded from a storage section 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data necessary for the operation of the system 900 are also stored. The CPU 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.
The following components are connected to the I/O interface 905: an input portion 906 including a keyboard, a mouse, and the like; an output section 907 including components such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 908 including a hard disk and the like; and a communication section 909 including a network interface card such as a LAN card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The drive 910 is also connected to the I/O interface 905 as necessary. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 910 as necessary, so that a computer program read out therefrom is mounted into the storage section 908 as necessary.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 909, and/or installed from the removable medium 911. The above-described functions defined in the system of the present disclosure are executed when the computer program is executed by a Central Processing Unit (CPU) 901.
It should be noted that the computer readable medium shown in the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Claims (10)

1. A user data classification method, comprising:
generating characteristics of the user data, the characteristics including category characteristics, demographic characteristics, and temporal characteristics of the purchased goods, wherein the temporal characteristics include purchase time weighted characteristics and characteristics related to individual life stages, and wherein the purchase time weighted characteristics are defined as follows:
Figure FDA0002593969570000011
where λ is the attenuation factor, T is the time stamp, TiA timestamp of the ith purchasing behavior of the user, and m is the total purchasing times of the user till T;
generating a labeled data set and an unlabeled data set of the user data according to the labeling rule;
constructing a positive sample labeling data set P and an unknown sample data set U of one of a plurality of categories according to the labeling data set and the unlabeled data set;
generating a classifier according to the positive sample labeling data set P, the unknown sample data set U and the characteristics of corresponding user data;
determining, using the classifier, whether user data in an unlabeled dataset belongs to the class.
2. The method of claim 1, wherein the user data is e-commerce user data and the plurality of categories are a plurality of life stages.
3. The method of claim 2, further comprising determining whether the user data satisfies a labeling rule, and if so, adding to a labeling dataset, the labeling rule comprising:
if the user data indicates that only goods of one life stage have been purchased, the purchase time is determined as the start time of the life stage,
if the user data indicates that a plurality of life stage commodities are purchased and purchased in chronological order, the time of the last purchase determines the start time of the corresponding life stage, and/or
If the user data indicates that commodities in a plurality of life stages are purchased and the commodities are not purchased according to the time sequence, determining the earliest ordering time belonging to the life stage as the start time of the life stage;
and determining which life stage the user data currently belongs to according to the determined start time of the life stages, the duration of each life stage and the current time.
4. The method according to claim 1, wherein the positive sample standard dataset P comprises user data in the labeled dataset belonging to said category, the unknown sample dataset U comprises at least a part of the set consisting of user data in the labeled dataset not belonging to said category and user data in the unlabeled dataset, and generating the classifier comprises the steps of:
setting the classifier M to be null and the reliable negative sample set RN to be null;
randomly sampling a part of user data S from P, adding U, updating P and U, and recording as Ps (P-S) and Us (U + S);
training the logistic regression classifier LR using Ps as positive samples and Us as negative samplesiI is 0, 1, …, as follows
(1) Setting a classifier threshold th by using S;
(2) for each sample u ∈ Us: if at LRiIf the result of the classifier of (a) is less than the threshold th, adding u to the RN, and Us-RN;
(3)M=M+LRi
training the logistic regression classifier LR using Ps as positive samples and RN as negative samplesiRepeating the steps (1) to (3) until an iteration termination condition is met to obtain a classifier LRlast
Using LRlastClassify P and return LR if more than a threshold number of positive samples are determined to be negative1As a final classifier, otherwise return LRlastAs the final classifier.
5. A user data sorting apparatus comprising:
a feature generation unit configured to generate features of the user data, the features including category features, demographic attributes features, and temporal features of the purchased goods, wherein the temporal features include purchase time weighting features and features related to respective life stages, and wherein the purchase time weighting features are defined as follows:
Figure FDA0002593969570000021
where λ is the attenuation factor, T is the time stamp, TiA timestamp of the ith purchasing behavior of the user, and m is the total purchasing times of the user till T;
the annotation unit is configured to generate an annotated data set and an unlabeled data set of the user data according to the annotation rule;
the sample construction unit is configured to construct a positive sample labeling data set P and an unknown sample data set U of one of a plurality of categories according to the labeling data set and the unlabeled data set;
the classifier generating unit is configured to generate a classifier according to the positive sample labeling data set P, the unknown sample data set U and the characteristics of the corresponding user data;
a classification unit configured to determine whether user data in an unlabeled dataset belongs to the class using the classifier.
6. The apparatus of claim 5, wherein the user data is e-commerce user data and the plurality of categories are a plurality of life stages.
7. The apparatus of claim 6, wherein the annotation unit is further configured to determine whether the user data satisfies an annotation rule, and if so, to add to an annotated data set, the annotation rule comprising:
if the user data indicates that only goods of one life stage have been purchased, the purchase time is determined as the start time of the life stage,
if the user data indicates that a plurality of life stage commodities are purchased and purchased in chronological order, the time of the last purchase determines the start time of the corresponding life stage, and/or
If the user data indicates that commodities in a plurality of life stages are purchased and the commodities are not purchased according to the time sequence, determining the earliest ordering time belonging to the life stage as the start time of the life stage;
wherein the annotation unit is further configured to determine to which life stage the user data currently belongs, depending on the determined start time of the life stage, the duration of each life stage and the current time.
8. The apparatus according to claim 5, wherein the positive sample standard dataset P comprises user data in the labeled dataset belonging to said category, the unknown sample dataset U comprises at least a part of the set consisting of user data in the labeled dataset not belonging to said category and user data in the unlabeled dataset, and the classifier generation unit is further configured to:
setting the classifier M to be null and the reliable negative sample set RN to be null;
randomly sampling a part of user data S from P, adding U, updating P and U, and recording as Ps (P-S) and Us (U + S);
training the logistic regression classifier LR using Ps as positive samples and Us as negative samplesiI is 0, 1, …, as follows
(1) Setting a classifier threshold th by using S;
(2) for each sample u ∈ Us: if at LRiIf the result of the classifier of (a) is less than the threshold th, adding u to the RN, and Us-RN;
(3)M=M+LRi
training the logistic regression classifier LR using Ps as positive samples and RN as negative samplesiRepeating the steps (1) to (3) until an iteration termination condition is met to obtain a classifier LRlast
Using LRlastTo P proceedClassification, if more than a certain threshold number of positive samples are determined to be negative, LR is returned1As a final classifier, otherwise return LRlastAs the final classifier.
9. A server, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-4.
10. A computer readable storage medium storing computer instructions which, when executed by a computer, cause the computer to perform the method of any of claims 1 to 4.
CN201710401985.2A 2017-05-31 2017-05-31 User data classification method, device, server and computer readable storage medium Active CN107273454B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710401985.2A CN107273454B (en) 2017-05-31 2017-05-31 User data classification method, device, server and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710401985.2A CN107273454B (en) 2017-05-31 2017-05-31 User data classification method, device, server and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN107273454A CN107273454A (en) 2017-10-20
CN107273454B true CN107273454B (en) 2020-11-03

Family

ID=60065763

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710401985.2A Active CN107273454B (en) 2017-05-31 2017-05-31 User data classification method, device, server and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN107273454B (en)

Families Citing this family (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109801091B (en) * 2017-11-16 2022-12-20 腾讯科技(深圳)有限公司 Target user group positioning method and device, computer equipment and storage medium
CN109840788B (en) * 2017-11-27 2021-11-02 北京京东尚科信息技术有限公司 Method and device for analyzing user behavior data
CN109919790A (en) * 2017-12-13 2019-06-21 腾讯科技(深圳)有限公司 Group type recognition methods, device, electronic equipment and storage medium
US11599753B2 (en) * 2017-12-18 2023-03-07 Oracle International Corporation Dynamic feature selection for model generation
CN109961308B (en) * 2017-12-25 2021-05-25 北京京东尚科信息技术有限公司 Method and apparatus for evaluating tag data
CN108256907A (en) * 2018-01-09 2018-07-06 北京腾云天下科技有限公司 A kind of construction method and computing device of customer grouping model
CN108364192B (en) * 2018-01-16 2022-10-18 创新先进技术有限公司 User mining method and device and electronic equipment
CN108305099B (en) * 2018-01-18 2021-11-19 创新先进技术有限公司 Method and device for determining purchasing user
CN108399418B (en) * 2018-01-23 2021-09-03 北京奇艺世纪科技有限公司 User classification method and device
CN110706049A (en) * 2018-07-10 2020-01-17 北京京东尚科信息技术有限公司 Data processing method and device
CN109191167A (en) * 2018-07-17 2019-01-11 阿里巴巴集团控股有限公司 A kind of method for digging and device of target user
CN110428295A (en) * 2018-08-01 2019-11-08 北京京东尚科信息技术有限公司 Method of Commodity Recommendation and system
CN110826579A (en) * 2018-08-07 2020-02-21 北京京东尚科信息技术有限公司 Commodity classification method and device
CN109087145A (en) * 2018-08-13 2018-12-25 阿里巴巴集团控股有限公司 Target group's method for digging, device, server and readable storage medium storing program for executing
CN109325525A (en) * 2018-08-31 2019-02-12 阿里巴巴集团控股有限公司 Sample attribute assessment models training method, device and server
CN111225009B (en) * 2018-11-27 2023-06-27 北京沃东天骏信息技术有限公司 Method and device for generating information
CN111340053A (en) * 2018-12-03 2020-06-26 北京嘀嘀无限科技发展有限公司 Order classification method, classification system, computer device and readable storage medium
CN111325228B (en) * 2018-12-17 2021-04-06 上海游昆信息技术有限公司 Model training method and device
CN109948730A (en) * 2019-03-29 2019-06-28 中诚信征信有限公司 A kind of data classification method, device, electronic equipment and storage medium
CN110322281B (en) * 2019-06-06 2023-10-27 创新先进技术有限公司 Similar user mining method and device
CN110458641B (en) * 2019-06-28 2022-02-25 苏宁云计算有限公司 E-commerce recommendation method and system
CN110597984B (en) * 2019-08-12 2022-05-20 大箴(杭州)科技有限公司 Method and device for determining abnormal behavior user information, storage medium and terminal
CN110796482A (en) * 2019-09-27 2020-02-14 北京淇瑀信息科技有限公司 Financial data classification method and device for machine learning model and electronic equipment
CN110796171A (en) * 2019-09-27 2020-02-14 北京淇瑀信息科技有限公司 Unclassified sample processing method and device of machine learning model and electronic equipment
CN112580681B (en) * 2019-09-30 2022-02-01 北京星选科技有限公司 User classification method and device, electronic equipment and readable storage medium
CN110807547A (en) * 2019-10-22 2020-02-18 恒大智慧科技有限公司 Method and system for predicting family population structure
CN110807546A (en) * 2019-10-22 2020-02-18 恒大智慧科技有限公司 Community grid population change early warning method and system
CN111401962A (en) * 2020-03-20 2020-07-10 上海络昕信息科技有限公司 Key opinion consumer mining method, device, equipment and medium
CN111612519B (en) * 2020-04-13 2023-11-21 广发证券股份有限公司 Method, device and storage medium for identifying potential customers of financial products
CN112800109A (en) * 2021-01-21 2021-05-14 蜜兔(杭州)网络科技有限公司 Information mining method and system
CN113313561A (en) * 2021-07-29 2021-08-27 全屋优品科技(深圳)有限公司 Transaction management method and system for home soft package supply chain
CN114268559B (en) * 2021-12-27 2024-02-20 天翼物联科技有限公司 Directional network detection method, device, equipment and medium based on TF-IDF algorithm

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106127525A (en) * 2016-06-27 2016-11-16 浙江大学 A kind of TV shopping Method of Commodity Recommendation based on sorting algorithm

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4465417B2 (en) * 2006-12-14 2010-05-19 インターナショナル・ビジネス・マシーンズ・コーポレーション Customer segment estimation device
CN104090888B (en) * 2013-12-10 2016-05-11 深圳市腾讯计算机系统有限公司 A kind of analytical method of user behavior data and device
US20160267168A1 (en) * 2013-12-19 2016-09-15 Hewlett Packard Enterprise Development Lp Residual data identification
CN106202177B (en) * 2016-06-27 2017-12-15 腾讯科技(深圳)有限公司 A kind of file classification method and device

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106127525A (en) * 2016-06-27 2016-11-16 浙江大学 A kind of TV shopping Method of Commodity Recommendation based on sorting algorithm

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
区分用户长短期兴趣的IBCF改进算法;孙静宇等;《郑州大学学报(理学版)》;20100630;第42卷(第2期);第35-38页 *

Also Published As

Publication number Publication date
CN107273454A (en) 2017-10-20

Similar Documents

Publication Publication Date Title
CN107273454B (en) User data classification method, device, server and computer readable storage medium
US20190303709A1 (en) Feature information extraction method, apparatus, server cluster, and storage medium
US10235346B2 (en) Method and apparatus for inbound message summarization using message clustering and message placeholders
CN110363604B (en) Page generation method and device
CN102385727A (en) ID-value assessment device, ID-value assessment system, and ID-value assessment method
CN111400613A (en) Article recommendation method, device, medium and computer equipment
CN116862592B (en) Automatic push method for SOP private marketing information based on user behavior
CN111104590A (en) Information recommendation method, device, medium and electronic equipment
CN110717597A (en) Method and device for acquiring time sequence characteristics by using machine learning model
CN111429214B (en) Transaction data-based buyer and seller matching method and device
US20130339132A1 (en) Single-Source Data Analysis of Advertising and Promotion Effects
CN111429161A (en) Feature extraction method, feature extraction device, storage medium, and electronic apparatus
CN113298568B (en) Method and device for advertising
CN111225009B (en) Method and device for generating information
CN109840788A (en) For analyzing the method and device of user behavior data
CN112801685A (en) Information pushing method and device, computer equipment and storage medium
US10152754B2 (en) System and method for small business owner identification
CN114391159A (en) Digital anthropology and anthropology system
KR102404247B1 (en) Customer management system
CN113516496B (en) Advertisement conversion rate estimation model construction method, device, equipment and medium thereof
CN111460300B (en) Network content pushing method, device and storage medium
CN112801710A (en) Accurate advertisement recommendation system based on face big data
CN113779276A (en) Method and device for detecting comments
Chishti et al. Identify Website Personality by Using Unsupervised Learning Based on Quantitative Website Elements
CN110727797A (en) Label generation method and device, electronic equipment and computer readable medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant