CN107273454A

CN107273454A - User data sorting technique, device, server and computer-readable recording medium

Info

Publication number: CN107273454A
Application number: CN201710401985.2A
Authority: CN
Inventors: 赫南; 朱顺; 孙振鹏; 杨旭; 陈英杰; 完灏; 胡景贺; 温园旭; 李慧倩; 李婵怡
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2017-05-31
Filing date: 2017-05-31
Publication date: 2017-10-20
Anticipated expiration: 2037-05-31
Also published as: CN107273454B

Abstract

Present disclose provides a kind of user data sorting technique, including：Produce the feature of user data；According to mark rule, the labeled data collection and unlabeled data collection of user data are produced；According to labeled data collection and unlabeled data collection, the positive sample labeled data collection P and unknown sample data set U of a classification in multiple classifications are built；According to positive sample labeled data collection P and unknown sample data set U and corresponding user data feature, grader is produced；Determine whether the user data that unlabeled data is concentrated belongs to that described classification using grader.The disclosure marks learning algorithm to classify user data by improved positive example no specimen, it is adaptable to the crowd of similar division of life span in the feature extraction of crowd, digging system, so as to provide the electric business advertisement of accurate crowd's orientation.

Description

User data sorting technique, device, server and computer-readable recording medium

Technical field

This disclosure relates to Internet technical field, and in particular to a kind of user data sorting technique, device, server and meter Calculation machine readable storage medium storing program for executing.

Background technology

Market survey person and sociologist are in recent years all the more, it is realized that different classes of disappearing for instance in different division of life spans Expense person shows different Shopping Behaviors.The division of life span that some coarsenesses can be done to consumer divides, for example, going to school the stage (young man, and unmarried), newly-married (young man, and without child), the middle age (marries, and has 0 or multiple child), the old (age Higher or retirement, and children live on one's own life) etc..Obviously, i.e., the people of different division of life spans' (age bracket) shows disappearing for differentiation Take trend.For example, pregnant woman can buy folic acid, vitamin, mothers can buy corresponding business according to the age bracket of baby Product, such as milk powder, perambulator, safety seat, educational toy.In mother and baby's channel, vertical class app in electric business website, consumption Person's purchasing model is quite obvious.It can introduce consumer's in the accurate crowd of electric business advertisement orients business and commending system Division of life span orients, it is hereby achieved that preferably recommendation effect.

But during the present invention is realized, inventor has found that prior art at least has following technical problem：Method Validity highly dependent upon training data correctness and scale, simultaneously as some commodity such as mother and baby's class commodity due to its belong to Property standard feature, such as milk powder can clearly indicate of the right age scope, itself there is very strong crowd's orientation, be used as and recommend application May not be suitable.Accordingly, it would be desirable to a kind of method and device classified to user, preferably can classify to user, example As more accurately and reliably excavated and there is the consumer of identical division of life span in electric business system, so as to serve the essence of electric business advertisement Quasi- crowd's orientation.

The content of the invention

According to the first aspect of the disclosure there is provided a kind of user data sorting technique, methods described includes：Produce user The feature of data；According to mark rule, the labeled data collection and unlabeled data collection of user data are produced；According to the mark number According to collection and unlabeled data collection, the positive sample labeled data collection P and unknown sample data of a classification in multiple classifications are built Collect U；According to positive sample labeled data collection P and unknown sample data set U and corresponding user data feature, classification is produced Device；And determine whether the user data that unlabeled data is concentrated belongs to that described classification using the grader.

In one embodiment, the user data can be electric business user data, and the multiple classification is multiple life Stage, such as mother and baby division of life span.

In one embodiment, methods described can also include judging whether the user data meets mark rule, such as Fruit meets and is then added to labeled data concentration, and the mark rule can include：If user data indicates only to buy one The commodity of division of life span, then between being defined as the time buying at the beginning of the division of life span, if user data indicates to buy The commodity of multiple division of life spans and buy sequentially in time, then the time of last time purchase determine corresponding division of life span At the beginning of between, and/or if user data indicates the commodity of excessive division of life span of purchase and not purchased sequentially in time Buy, be then defined by earliest division of life span, the earliest time of placing an order for belonging to the division of life span is determined to the beginning of the division of life span Time.Methods described can also include, according between at the beginning of identified division of life span, duration of each division of life span And current time, determine which division of life span user data currently belongs to.

In one embodiment, the feature can include the classification feature of purchase commodity, ascribed characteristics of population feature with timely Between feature, the temporal characteristics can include time buying weighted feature and the feature relevant with each division of life span.

In one embodiment, the positive sample standard data set P can include labeled data concentration and belong to the classification User data, unknown sample data set U include by labeled data concentrate is not belonging to the classification user data and do not mark At least a portion in the set of user data composition in data set, and produce grader and may comprise steps of：

It is sky to set grader M, and reliable negative sample set RN is sky；

A part of user data S of stochastical sampling adds U from P, updates P and U, is designated as Ps=P-S, Us=U+S；

Using Ps as positive sample, Us is used as negative sample, training logistic regression grader LR_i, i=0,1 ..., as follows

(1) S setting grader threshold values th is utilized；

(2) for each sample u ∈ Us：If in LR_iClassifier result be less than threshold value th, then by u add RN in, And Us=Us-RN；

(3) M=M+LR_i；

Using Ps as positive sample, RN is used as negative sample, training logistic regression grader LR_i, repetition above step (1)- (3), until meeting stopping criterion for iteration, grader LR is obtained_last；

Use LR_lastP is classified, if it exceeds the positive sample of certain number of thresholds is judged as bearing, then LR is returned to₁ As final classification device, LR is otherwise returned_lastIt is used as final grader.

According to the second aspect of the disclosure there is provided a kind of user data sorter, including：Feature generation unit 701, Mark unit 702, sample construction unit 703, grader generation unit 704 and taxon 705.The quilt of feature generation unit 701 It is configured to produce the feature of user data.Mark unit 702 is configured as, according to mark rule, producing the mark number of user data According to collection and unlabeled data collection.Sample construction unit 703 is configured as according to the labeled data collection and unlabeled data collection, structure The positive sample labeled data collection P and unknown sample data set U for the classification built in multiple classifications.Grader generation unit 704 The feature according to positive sample labeled data collection P and unknown sample data set U and corresponding user data is configured as, is produced Grader.Taxon 705 is configured with the grader and determines whether the user data of unlabeled data concentration belongs to That described classification.

In one embodiment, the user data can be electric business user data, and the multiple classification can be multiple Division of life span, such as mother and baby division of life span.

In one embodiment, the mark unit can be additionally configured to judge whether the user data meets mark Rule, labeled data is added to if meeting and is concentrated, and the mark rule includes：If user data indicates only to buy one The commodity of individual division of life span, then between being defined as the time buying at the beginning of the division of life span, if user data indicates purchase The commodity of excessive division of life span and buy sequentially in time, then the time of last time purchase determine corresponding life rank Between at the beginning of section, and/or if user data indicates the commodity of excessive division of life span of purchase and without sequentially in time Purchase, then be defined by earliest division of life span, the earliest time of placing an order for belonging to the division of life span determined into opening for the division of life span Time beginning.The tag unit can be additionally configured to according to identified division of life span at the beginning of between, each division of life span Duration and current time, determine which division of life span user data currently belongs to.

In one embodiment, the feature can include the classification feature of purchase commodity, ascribed characteristics of population feature with timely Between feature, wherein the temporal characteristics can also include time buying weighted feature and the feature relevant with each division of life span.

In one embodiment, positive sample standard data set P can include the use that labeled data concentration belongs to the classification User data, unknown sample data set U, which can include being concentrated by labeled data, to be not belonging to the user data of the classification and does not mark At least a portion in the set of user data composition in data set, and grader generation unit can be additionally configured to：

It is sky to set grader M, and reliable negative sample set RN is sky；

(1) S setting grader threshold values th is utilized；

(3) M=M+LR_i；

According to the third aspect of the disclosure there is provided a kind of server, including：One or more processors；Storage device, is used In storing one or more programs, when one or more of programs are by one or more of computing devices so that described One or more processors realize method as described in relation to the first aspect.

According to the fourth aspect of the disclosure, there is provided a kind of computer-readable recording medium, the computer-readable storage medium Matter stores computer instruction, and the computer instruction, which is worked as, to be computer-executed so that the computer is performed such as first aspect institute The method stated.

The present disclosure proposes improved user data sorting technique, by labeled data collection and unlabeled data collection, to instruct Practice grader, more accurately classify so as to realize.More specifically, can be in electric business advertisement accurately crowd's orientation business Division of life span's orientation is introduced, it can expand the orientation applied to each division of life span for including mother and baby division of life span, so that More preferable personalized recommendation effect can be provided.

Brief description of the drawings

By description referring to the drawings to the embodiment of the present disclosure, the above-mentioned and other purposes of the disclosure, feature and Advantage will be apparent from, in the accompanying drawings：

Fig. 1 is the synoptic diagram for showing to excavate basic procedure 100 according to the crowd of the embodiment of the present disclosure.

Fig. 2A to 2D is to show the schematic diagram for being used to mark the division of life span of user data according to the embodiment of the present disclosure；

Fig. 3 is the schematic diagram for showing the electric business bibliography system according to the tree structure of the embodiment of the present disclosure；

Fig. 4 is the flow chart for the method for showing the label of the specific division of life span of generation according to the embodiment of the present disclosure；

Fig. 5 is the schematic diagram for showing to be designed according to the ABTest labels evaluation of the embodiment of the present disclosure；

Fig. 6 is the flow chart for showing the user data sorting technique according to the embodiment of the present disclosure；

Fig. 7 is the schematic block diagram for showing the user data sorter according to the embodiment of the present disclosure；

Fig. 8 is to show that the user data sorting technique of the disclosure or the exemplary system of user data sorter can be applied The schematic block diagram of system framework 800；And

Fig. 9 is the structural representation for showing the computer system 900 for realizing the embodiment of the present disclosure.

Embodiment

Hereinafter, it will be described with reference to the accompanying drawings embodiment of the disclosure.However, it should be understood that these descriptions are simply exemplary , and it is not intended to limit the scope of the present disclosure.In addition, in the following description, the description to known features and technology is eliminated, with Avoid unnecessarily obscuring the concept of the disclosure.

Term as used herein is not intended to limit the disclosure just for the sake of description specific embodiment.Used here as Word " one ", " one (kind) " and "the" etc. should also include " multiple ", the meaning of " a variety of ", unless context clearly refers in addition Go out.In addition, term " comprising " as used herein, "comprising" etc. indicate the presence of the feature, step, operation and/or part, But it is not excluded that in the presence of or add one or more other features, step, operation or part.

All terms (including technology and scientific terminology) as used herein have what those skilled in the art were generally understood Implication, unless otherwise defined.It should be noted that term used herein should be interpreted that with consistent with the context of this specification Implication, without that should be explained with idealization or excessively mechanical mode.

Shown in the drawings of some block diagrams and/or flow chart.It should be understood that some sides in block diagram and/or flow chart Frame or its combination can be realized by computer program instructions.These computer program instructions can be supplied to all-purpose computer, The processor of special-purpose computer or other programmable data processing units, so that these instructions can be with when by the computing device Create the device for realizing function/operation illustrated in these block diagrams and/or flow chart.

Hereinafter, the disclosure will illustrate to enter user data by taking the excavation of division of life span (for example, mother and baby division of life span) as an example Row is excavated classifies in other words, but those skilled in the art are it will be appreciated that the disclosure can also be expanded applied to other classification.

Fig. 1 shows the synoptic diagram that basic procedure 100 is excavated according to the crowd of the embodiment of the present disclosure.

As shown in figure 1, excavating basic procedure 100 according to the crowd of the embodiment can include in 120 carrying out data Mark.For example, user data can be obtained from data warehouse 110, the buying behavior of electric business user is analyzed, and define rationally rule Then, labeled data collection and unlabeled data collection are produced to automate mark, it is explained in detail below.

In addition, the crowd excavates construction feature during basic procedure 100 can be included in 130.For example, can be from data warehouse 110 obtain user data, and the feature for being available for training is extracted from the buying behavior of user.

Labeled data collection and unlabeled data collection and the feature construction operation 130 obtained in data labeling operation 120 In obtained feature can send into grader and produce in model 140, to produce grader.For example, it is possible to use labeled data The training characteristics of collection and unlabeled data collection and composition, carry out positive example and unmarked sample study, and pass through this learning process Sorter model is generated, to be labeled for a large amount of Unlabeled datas.Specifically, as shown in figure 1, utilizing positive example no specimen mark Remember learning algorithm, iteratively produce logistic regression (LR) grader.Although fig 1 illustrate that two LR graders, but this Art personnel are appreciated that this is only example, and expression iteratively produces grader.

Furthermore it is also possible to carry out effect assessment in the grader produced by 150 pairs.It is, for example, possible to use on test set, line A/B, which is tested, to be evaluated the classification results of grader.

Each above-mentioned process step described in detail below.The disclosure illustrates the disclosure so that mother and baby crowd excavates as an example Particular content.For example, data area is in January, 2015-consumer's consumer behavior on the website of Jingdone district in December.For sake of convenience, Mother and baby crowd can be divided to following each stage, and different label crowds are represented with letter Lx, referring to following table.

The mother and baby crowd's stage of table 1 and label value

Label value	Mother and baby division of life span
		L0	Pregnancy
L1	Baby 0-3 months
		L2	Baby 3-6 months
L3	Baby 6-12 months
		L4	Baby 12-24 months

Labeled data

It is possible, firstly, to do statistical analysis to user data, the abnormal user of the amount of placing an order within a period of time is removed.For example, The frequency superelevation that placed an order in nearest 1 year and extremely low user, this certain customers are considered as having brush single act or consumption feature not to show Write.

Secondly, to the user after filtering, it can first determine when to enter some mother and baby's stage, it is special further according to its behavior Levy and determine which mother and baby's stage (for example, confirmation on temporal continuity or role) it is currently likely to be at.

Hereinafter, how description is determined when user enters some mother and baby's stage.

Some commodity are only applicable to certain class mother and baby's stage, for example, the user in L0 can more likely buy exposure suit or Folic acid.By, by interim division, can substantially judge purchase crowd mother and baby's stage residing at present to mother and baby's commodity.No The corresponding feature commodity and item property is whole can manage into similar table below with the stage.

Corresponding feature commodity of the different mother and baby's stages of table 2

In order to follow the trail of buying behavior sequence of user's whole year to mother and baby's commodity, an order can be established for each user Each stage behavioral statisticses table of mother and baby, records the order total amount that each user's purchase belongs to some mother and baby's stage, for the first time and most Secondary time buying afterwards.Following model can be also served as with mother and baby's state of some users of preliminary judgement by these statistics The foundation characteristic of prediction.

Fig. 2A shows the behavioral statisticses in each mother and baby stage, and it records each user and buys each stage mother and baby's class commodity Relevant information, for example, purchase belongs to the order total amount of Lx (x=0,1,2,3,4) commodity, initial purchase and belongs to Lx commodity Time, and last time purchase belong to the time of Lx commodity.

After each mother and baby's stage behavioral statistics of generation user, it can determine that user is having according to following mark rule In which and initially enter time in this stage in buying behavior in mother and baby's stage：

Rule one, user placed an order not across multiple mother and baby's stages.At this moment, user, which only descended, belongs to a certain mother and baby's stage commodity Order.For example, as shown in Figure 2 B, user have purchased the commodity (1 time or multiple) for belonging to the L4 stages, not buy the L0-L3 stages Commodity.Such case judges that user is counted from the early time point that once places an order and is in the L4 stages.Belong to L4 because user only descended The order in stage, thus output user (L4,2015-11-23), show the user from November 23rd, 2015 initially enter L4 this The individual stage.

In single span multiple mother and baby's stages under rule two, user, this is further subdivided into two kinds of situations：

(a) time that places an order in multiple stages does not intersect.I.e. user is, according to time sequence, sequentially to have purchased mother and baby's stage Product, then corresponding division of life span is defined by the commodity bought for the last time, with placing an order the time earliest for that division of life span Count.For example, as shown in Figure 2 C, the commodity that user finally places an order are in the L4 stages, so output user (L4,2015-12- 21), show that the user initially entered this mother and baby's stage of L4 from December 21st, 2015.

(b) there is intersection the time of placing an order in multiple stages.I.e. user have purchased the commodity of multiple division of life spans, and corresponding people The raw stage is not in accordance with time sequencing evolution.For example, user's (actual capabilities are gestations), has first bought a perambulator (it is assumed that 3-6 months use), it is rear to have bought a nipple (it is assumed that 0-3 months use) again, then to buy right in commodity with user Division of life span's being defined earliest is answered, it is approximate to judge that user is in the L1 stages (0-3 months), and being placed an order the time with the earliest of L1 stages Count.For example, as shown in Figure 2 D, the commodity that user finally places an order are in the L1 stages, so output user (L1,2015-09- 10), show that the user enters the L1 stages in 2015-09-10.

Determining after when user enter some mother and baby's stage, it is possible to according to each stage institute divided in advance Which duration, to calculate the user into after mother and baby's stage, should be in current (for example, on December 31st, 2015) Mother and baby's stage.This partial data (that is, will be noted as belonging to specific mother for the positive example sample for setting up sorter model training The user data in baby's stage), it will be described in later.

It should be noted that notwithstanding can calculate which mother and baby user should be at using above-mentioned mark rule Stage, but above-mentioned rule may can only cover a part of user data and produce labeled data.That is, described rule may not Can all user data of covering, at this moment, rule coverage less than user data will form unlabeled data, need to pass through in the future Grader is classified.

Thus, by labeled data 120, labeled data collection can be produced by the user data of data warehouse 110 and do not marked Note data set.

Construction feature

Construction feature is needed to produce the input of model 140 as grader before training pattern, feature used can be wrapped Include following groups：Classification feature, user's ascribed characteristics of population feature and temporal characteristics, it is described separately below.

Classification feature

In general, each electric business carrys out display goods with level classification.For example, Jingdone district is not belonged to together with three-level classification to display Property commodity, be easy to user quickly to navigate to required commodity.For example, Fig. 3 shows the tree-like knot according to the embodiment of the present disclosure The electric business bibliography system of structure.Wherein, the commodity in Jingdone district store include：The household electrical appliance ... ... of first order classification, books, phonotapes and videotapes electricity Philosophical works etc.；The big household electrical appliances ... ... of second level classification under household electrical appliance, individual shield health, hardware house ornamentation etc.；And under big household electrical appliances The flat panel TV of third level classification, air-conditioning, washing machine etc..

User's purchase commodity reflect his demand at that time or in a period of time from now on.For example, the mother at the initial stage of pregnancy It is more likely to buy maternity dress, exposure suit etc., and the later stage is possible to buy the arrival for her child such as diaper, milk powder, infanette Prepare in advance.But then not necessarily it is concerned about for the diaper for buying what brand (such as helping precious suitable or flower king or other brands). Therefore, buying behavior of the selection user to three-level classification can just describe the demand of user with finer grain, and can be same Type commodity return into a class.

, can be user as document, each classification as occurring in document in order to reduce the influence of popular classification Word, calculates user TF-IDF (word frequency-reverse document-frequency) values to build classification characteristic vector.

User's ascribed characteristics of population feature

Generally, the consumer behavior of user is relevant with the ascribed characteristics of population feature of user.For example, the user of different age group, property Not, all can be variant in the consumption habit of user in difference such as member's grades (often embodying its consuming capacity) of electric business Embody.The disclosure uses registered user's information of electric business website, the Shopping Behaviors of user, extracts the spy of multiple user's dimensions Levy, be referred to as " user's portrait ".It is as shown in the table, an example of user's ascribed characteristics of population feature.

The user's ascribed characteristics of population feature of table 3

Temporal characteristics

Temporal characteristics can include the temporal characteristics relevant for example with each division of life span's (for example, mother and baby's stage) and time Weighted feature.

The temporal characteristics relevant with each division of life span.For example, user buys maternity dress and bought pregnant before January the year before Woman's dress is to speculating that its which present residing mother and baby has very big difference the stage, and the latter is more likely to belong to the L0 stages；Meanwhile, if user Repeatedly purchase belong to certain mother and baby's stage (L0) commodity, then can substantially speculate user experienced in this stage how long, one The user that experienced 9 months period of pregnancys is more likely to purchase next stage than the user of a period of pregnancy that experienced 2 months (L1) commodity.Therefore, the disclosure proposes exemplary mother and baby's commodity purchasing feature as shown in the table.

The user of table 4 buys the temporal characteristics of each mother and baby's stage commodity

Time weight feature.Similarly, for example, a user bought commodity and commodity were bought before January to it the year before Present liveness also has very big difference, and the latter more likely buys commodity again in a short time.Define time weight feature public Formula is as follows：

Wherein λ is decay factor, and the disclosure can be the timestamp on December 31st, 2015, ti by 5.0/365, T of value For the date and time stamp that places an order of user's ith, m is the total degree that places an order of this user.

Finally, each category feature is normalized during training, is organized into the eigenmatrix of various dimensions, wherein each user Characteristic vector homography a line, it is as follows：

Sorter model is produced

In the disclosure, sorter model, which is produced, can include positive example and without mark sample learning.Disclosure application half is supervised Educational inspector's learning method (PU-Learning) realizes that crowd extends.As described above, can only obtain small-scale positive sample by marking rule This labeled data, and reliable negative sample set can not be marked, therefore can not directly train reliable disaggregated model.The disclosure The classification problem of only a small amount of positive example and a large amount of unknown samples is solved by application positive example and the learning method of no specimen mark. Specifically, the present disclosure proposes the algorithm that one kind can be described as " spy's technology ", it adds positive example by certain sample rate unmarked Training pattern obtains reliably bearing example in sample, and the sample rate, which refers to extract, is mixed into positive example data volume in data untagged always just The total accounting of example data volume.

The algorithm basic thought is as follows：

Due to unreliable negative sample, so initial reliable negative sample set RN is sky, portion is randomly selected from positive sample P Divided data S is added in unknown sample U, can obtain Ps and Us, is training initial logic time after tagged 1, the Us of Ps tagged 0 Return grader, recycle S data collection given threshold to classify for whole U, the data W of grader mark 0 is added in RN, hereafter Grader is trained with Ps and RN, is remaining U classification, classification, which is obtained 0 sample, to be added in RN, is iterated until meeting eventually Only condition.In a word, premised on ensureing the classification accuracy rate to positive example, positive example and the no specimen mark each iteration of learning algorithm can To expand reliable negative sample collection.

Generally, it is applied to two classification problems positive example and no specimen mark learning method more, and mother and baby division of life span divides category In many classification problems, classification problem more than one is changed into multiple two classification problems by the disclosure using one-vs-rest.

Complete positive example and no specimen mark learning algorithm flow based on " spy's technology " can be summarized as follows：

Algorithm positive example and no specimen mark learning algorithm flow

In above-mentioned algorithm, positive example and no specimen mark learning algorithm have some parameters to need setting, such as sample rate s% With threshold value th.To prevent the positive sample trained from very little, causing S plays the role of certain planning can reach " spy " again, this It is open to use such as 15% sample rate.Ideally, to the threshold value th of the model specification produced in each iteration So that whole S data collection can correctly be divided into positive example, but due to there is noise in data, th setting ensures model to S data Collection classification it is accurate between such as 80%-100%, the disclosure with ensure to S classify accuracy such as 95% Th threshold values are set.

Fig. 4 shows the flow chart of the method 400 according to the generation of embodiment of the present disclosure pregnancy (L0) phase tag.Ability Field technique personnel, which are appreciated that, is also applied for generation L1 with same method to L4 phase tags.

Method 400 is included in step 401 and started.Then in step 402, judge whether user data meets automation mark Rule.In this way, then the positive example of user data is obtained in step 403, that is to say, that obtain labeled data collection.At this moment, labeled data Collection can include L0 to the user data in each stage of L4.If being no in step 402, corresponding user data is constituted and not marked Data are noted, i.e., automation mark rule can not cover user data.

Next, in step 405, whether the positive example user data that judgment step 403 is obtained belongs to the L0 stages, because below Two classification will be carried out using one-vs-rest mode.In this way, then corresponding user data is labeled as 1, and can be according to 8: 1: 1 ratio, random generation training dataset P, validation data set and test set.It will be understood by those skilled in the art that can So that positive example data are divided into training dataset, validation data set and test set according to a certain percentage to produce more accurately and reliably Grader, and this proportionate relationship is not limited to the above.For the unlabeled data produced in step 404 and pass through step 405 judge to be not belonging to the user data in L0 stages, can merge them, not marked therefrom to extract to produce according to a certain percentage Remember data set U.For example, can according to training dataset P, to P with 1: 10 ratio the stochastical sampling data from merging data, with Produce Unlabeled data collection U.That is, the ratio between positive example sample and the user data of unmarked sample are 1: 10.

In addition, except positive example training dataset P and Unlabeled data collection U, in addition it is also necessary to which the feature of user data is used as just Example and no specimen mark the input of learning algorithm.Therefore, in step 408, the feature of user data can be extracted.By upper, in step Rapid 409, mark learning algorithm to produce grader by positive example and no specimen, detailed process may be referred to flow described above.

Then, in step 410, using produced grader, unlabeled data is classified, if in step 411 1 is output as, then corresponding user data is stamped into L0 labels in step 412.Step 413, method 400 terminates.

Repeat similar method, it is possible to which all customer data is categorized into each division of life span.

Effect assessment

In addition to carrying out cross validation to disaggregated model offline and evaluating, the disclosure have also been devised verification machine on a kind of ABTest lines System, it can be verified by verifying label quality and operational indicator on ABTest lines by the prediction to electric business consumer demographics The reliability of Result.

As shown in figure 5, showing that the evaluation of ABTest labels is designed, whether tagging user is hit from flow end according to exposure, It is 3 set wherein traffic partition.Set A：Represent the exposure for participating in testing；Set B：When representing to ask in A with user Exposure with label L still to be tested；C gathers：Represent the exposure triggered in B by orientation label L.Here, with mother and baby crowd's label Exemplified by checking, to weigh the value that L0 orients label, comparison of design experiment is as follows：

Exp-base：Pv (page browsing amount, page view) samples, benchmarks, represents proper use of label L0 number According to；

Exp-random：Pv samples, random experiments, and (implementation is the data for representing using random orientation label L0 User u1 and u2 are randomly choosed, their orientation L0 label datas are exchanged, other are constant)；

Exp-unuse：Pv samples, and without using orientation label L0, manually removes the L0 label datas with user.

By ABTest systems, in set B, Exp-base and Exp-random, Exp-unuse are contrasted respectively, in C collection Close, contrast Exp-base and Exp-random, observation such as CPM (Cost Per Mille, thousand displayings are paid), CTR Advertising business indexs such as (Click Through Rate, clicking rate) etc..

Fig. 6 shows a kind of user data sorting technique 600 according to the embodiment of the present disclosure, and methods described 600 includes： Step 601, the feature of user data is produced；In step 602, according to mark rule, produce user data labeled data collection and Unlabeled data collection；In step 603, according to the labeled data collection and unlabeled data collection, build multiple (for example, more than 2 It is individual) the positive sample labeled data collection P and unknown sample data set U of a classification in classification；In step 604, according to positive sample Labeled data collection P and unknown sample data set U and corresponding user data feature, produce grader；And in step 605, determine whether the user data that unlabeled data is concentrated belongs to that described classification using the grader.

In one embodiment, the user data can be electric business user data, and the multiple classification is multiple life Stage.

In one embodiment, methods described 600 can also include judging whether the user data meets mark rule, It is added to labeled data if meeting to concentrate, the mark rule can include：If user data indicates only to buy one The commodity of individual division of life span, then between being defined as the time buying at the beginning of the division of life span, if user data indicates purchase The commodity of excessive division of life span and buy sequentially in time, then the time of last time purchase determine corresponding life rank Between at the beginning of section, and/or if user data indicates the commodity of excessive division of life span of purchase and without sequentially in time Purchase, then be defined by earliest division of life span, the earliest time of placing an order for belonging to the division of life span determined into opening for the division of life span Time beginning.Methods described can also include, according between at the beginning of identified division of life span, each division of life span it is lasting when Between and current time, determine which division of life span user data currently belongs to.

It is sky to set grader M, and reliable negative sample set RN is sky；

(1) S setting grader threshold values th is utilized；

(3) M=M+LR_i；

Fig. 7 shows the user data sorter 700 according to the embodiment of the present disclosure.User data sorter 700 is wrapped Include：Feature generation unit 701, mark unit 702, sample construction unit 703, grader generation unit 704 and taxon 705.Feature generation unit 701 is configured as producing the feature of user data.Mark unit 702 is configured as according to mark rule Then, the labeled data collection and unlabeled data collection of user data are produced.Sample construction unit 703 is configured as according to the mark Data set and unlabeled data collection, build the positive sample labeled data collection P and unknown sample number of a classification in multiple classifications According to collection U.Grader generation unit 704 is configured as according to positive sample labeled data collection P and unknown sample data set U and relative The feature for the user data answered, produces grader.Taxon 705 is configured with the grader and determines unlabeled data Whether the user data of concentration belongs to that described classification.

In one embodiment, the user data can be electric business user data, and the multiple classification can be multiple Division of life span.

It is sky to set grader M, and reliable negative sample set RN is sky；

(1) S setting grader threshold values th is utilized；

(3) M=M+LR_i；

Fig. 8 shows user data sorting technique or the exemplary system of user data sorter that can be using the disclosure System framework 800.

As shown in figure 8, system architecture 800 can include terminal device 801,802,803, network 804 and server 805. Medium of the network 804 to provide communication link between terminal device 801,802,803 and server 805.Network 804 can be with Including various connection types, such as wired, wireless communication link or fiber optic cables etc..

User can be interacted with using terminal equipment 801,802,803 by network 804 with server 805, to receive or send out Send message etc..Various telecommunication customer end applications can be installed, class of for example doing shopping application, net on terminal device 801,802,803 (merely illustrative) such as the application of page browsing device, searching class application, JICQ, mailbox client, social platform softwares.

Terminal device 801,802,803 can be the various electronic equipments browsed with display screen and supported web page, bag Include but be not limited to smart mobile phone, tablet personal computer, pocket computer on knee and desktop computer etc..

Server 805 can be to provide the server of various services, for example, utilize terminal device 801,802,803 to user The shopping class website browsed provides the back-stage management server (merely illustrative) supported.Back-stage management server can be to receiving To the data such as information query request carry out the processing such as analyzing, and by result (such as target push information, product letter Breath -- merely illustrative) feed back to terminal device.

It should be noted that the user data sorting technique that the embodiment of the present application is provided typically can be by server 805 Perform, correspondingly, user data sorter can be typically arranged in server 805.

It should be understood that the number of the terminal device, network and server in Fig. 8 is only schematical.According to realizing need Will, can have any number of terminal device, network and server.

Below with reference to Fig. 9, it illustrates the structural representation for being suitable for use in the computer system 900 for realizing the embodiment of the present disclosure Figure.Computer system shown in Fig. 9 is only an example, should not be appointed to the function of the embodiment of the present disclosure and using range band What is limited.

As shown in figure 9, computer system 900 includes CPU (CPU) 901, it can be read-only according to being stored in Program in memory (ROM) 902 or be loaded into program in random access storage device (RAM) 903 from storage part 908 and Perform various appropriate actions and processing.In RAM 903, the system that is also stored with 900 operates required various programs and data. CPU 901, ROM 902 and RAM 903 are connected with each other by bus 904.Input/output (I/O) interface 905 is also connected to always Line 904.

I/O interfaces 905 are connected to lower component：Importation 906 including keyboard, mouse etc.；Penetrated including such as negative electrode The output par, c 907 of spool (CRT), liquid crystal display (LCD) etc. and loudspeaker etc.；Storage part 908 including hard disk etc.； And the communications portion 909 of the NIC including LAN card, modem etc..Communications portion 909 via such as because The network of spy's net performs communication process.Driver 910 is also according to needing to be connected to I/O interfaces 905.Detachable media 911, such as Disk, CD, magneto-optic disk, semiconductor memory etc., are arranged on driver 910, in order to read from it as needed Computer program be mounted into as needed storage part 908.

Especially, in accordance with an embodiment of the present disclosure, the process described above with reference to flow chart may be implemented as computer Software program.For example, embodiment of the disclosure includes a kind of computer program product, it includes being carried on computer-readable medium On computer program, the computer program include be used for execution flow chart shown in method program code.In such reality Apply in example, the computer program can be downloaded and installed by communications portion 909 from network, and/or from detachable media 911 are mounted.When the computer program is performed by CPU (CPU) 901, limited in the system for performing the disclosure Above-mentioned functions.

It should be noted that the computer-readable medium shown in the application can be computer-readable signal media or meter Calculation machine readable storage medium storing program for executing either the two any combination.Computer-readable recording medium for example can be --- but not Be limited to --- electricity, magnetic, optical, electromagnetic, system, device or the device of infrared ray or semiconductor, or it is any more than combination.Meter The more specifically example of calculation machine readable storage medium storing program for executing can include but is not limited to：Electrical connection with one or more wires, just Take formula computer disk, hard disk, random access storage device (RAM), read-only storage (ROM), erasable type and may be programmed read-only storage Device (EPROM or flash memory), optical fiber, portable compact disc read-only storage (CD-ROM), light storage device, magnetic memory device, Or above-mentioned any appropriate combination.In this application, computer-readable recording medium can any include or store journey The tangible medium of sequence, the program can be commanded execution system, device or device and use or in connection.And at this In application, computer-readable signal media can be included in a base band or as the data-signal of carrier wave part propagation, Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including but not limit In electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be that computer can Any computer-readable medium beyond storage medium is read, the computer-readable medium, which can send, propagates or transmit, to be used for Used by instruction execution system, device or device or program in connection.Included on computer-readable medium Program code can be transmitted with any appropriate medium, be included but is not limited to：Wirelessly, electric wire, optical cable, RF etc., or above-mentioned Any appropriate combination.

Flow chart and block diagram in accompanying drawing, it is illustrated that according to the system of the various embodiments of the application, method and computer journey Architectural framework in the cards, function and the operation of sequence product.At this point, each square frame in flow chart or block diagram can generation The part of one module of table, program segment or code, a part for above-mentioned module, program segment or code is comprising one or more Executable instruction for realizing defined logic function.It should also be noted that in some realizations as replacement, institute in square frame The function of mark can also be with different from the order marked in accompanying drawing generation.For example, two square frames succeedingly represented are actual On can perform substantially in parallel, they can also be performed in the opposite order sometimes, and this is depending on involved function.Also It is noted that the combination of each square frame in block diagram or flow chart and the square frame in block diagram or flow chart, can use and perform rule Fixed function or the special hardware based system of operation realize, or can use the group of specialized hardware and computer instruction Close to realize.

Claims

1. a kind of user data sorting technique, including：

Produce the feature of user data；

According to mark rule, the labeled data collection and unlabeled data collection of user data are produced；

According to the labeled data collection and unlabeled data collection, the positive sample labeled data of a classification in multiple classifications is built Collect P and unknown sample data set U；

According to positive sample labeled data collection P and unknown sample data set U and corresponding user data feature, classification is produced Device；

Determine whether the user data that unlabeled data is concentrated belongs to that described classification using the grader.

2. according to the method described in claim 1, wherein, the user data is electric business user data, and the multiple classification is Multiple division of life spans.

3. method according to claim 2, in addition to judge whether the user data meets mark rule, if met Labeled data concentration is then added to, the mark rule includes：

If user data indicates only to buy the commodity of a division of life span, the time buying is defined as the division of life span's Time started,

If user data indicates the commodity of excessive division of life span of purchase and bought sequentially in time that last time is bought Time determine at the beginning of corresponding division of life span between, and/or

If user data indicates the commodity of excessive division of life span of purchase and not bought sequentially in time, with earliest Division of life span is defined, between the earliest time of placing an order for belonging to the division of life span is determined at the beginning of the division of life span；

Wherein, methods described also includes, according between at the beginning of identified division of life span, duration of each division of life span And current time, determine which division of life span user data currently belongs to.

4. method according to claim 2, wherein, it is special that the feature includes the classification feature of purchase commodity, the ascribed characteristics of population Levy and temporal characteristics, wherein the temporal characteristics include time buying weighted feature and the spy relevant with each division of life span Levy.

5. according to the method described in claim 1, wherein, positive sample standard data set P includes labeled data and concentrated to belong to described The user data of classification, unknown sample data set U includes concentrating the user data and not for being not belonging to the classification by labeled data At least a portion in the set for the user data composition that labeled data is concentrated, and produce grader and comprise the following steps：

It is sky to set grader M, and reliable negative sample set RN is sky；

Using Ps as positive sample, Us is used as negative sample, training logistic regression grader LR_i, i=0,1 ..., it is as follows

(1) S setting grader threshold values th is utilized；

(3) M=M+LR_i；

Using Ps as positive sample, RN is used as negative sample, training logistic regression grader LR_i, above step (1)-(3) are repeated, directly To stopping criterion for iteration is met, grader LR is obtained_last；

Use LR_lastP is classified, if it exceeds the positive sample of certain number of thresholds is judged as bearing, then LR is returned to₁As Final classification device, otherwise returns to LR_lastIt is used as final grader.

6. a kind of user data sorter, including：

Feature generation unit, is configured as producing the feature of user data；

Unit is marked, is configured as, according to mark rule, producing the labeled data collection and unlabeled data collection of user data；

Sample construction unit, is configured as, according to the labeled data collection and unlabeled data collection, building one in multiple classifications The positive sample labeled data collection P and unknown sample data set U of individual classification；

Grader generation unit, is configured as according to positive sample labeled data collection P and unknown sample data set U and corresponding The feature of user data, produces grader；

Taxon, be configured with the user data that the grader determines that unlabeled data is concentrated whether belong to it is described that One classification.

7. device according to claim 6, wherein, the user data is electric business user data, and the multiple classification is Multiple division of life spans.

8. device according to claim 7, wherein the mark unit is additionally configured to whether judge the user data Mark rule is met, labeled data is added to if meeting and is concentrated, the mark rule includes：

Wherein, between at the beginning of the tag unit is additionally configured to according to identified division of life span, each division of life span Duration and current time, determine which division of life span user data currently belongs to.

9. device according to claim 7, wherein, it is special that the feature includes the classification feature of purchase commodity, the ascribed characteristics of population Levy and temporal characteristics, wherein the temporal characteristics include time buying weighted feature and the spy relevant with each division of life span Levy.

10. device according to claim 6, wherein, positive sample standard data set P includes labeled data concentration and belongs to described The user data of classification, unknown sample data set U includes concentrating the user data and not for being not belonging to the classification by labeled data At least a portion in the set for the user data composition that labeled data is concentrated, and grader generation unit is additionally configured to：

It is sky to set grader M, and reliable negative sample set RN is sky；

(1) S setting grader threshold values th is utilized；

(3) M=M+LR_i；

11. a kind of server, including：

One or more processors；

Storage device, for storing one or more programs,

When one or more of programs are by one or more of computing devices so that one or more of processors are real The existing method as any one of claim 1 to 5.

12. a kind of computer-readable recording medium, the computer-readable recording medium storage computer instruction, the computer Instruction, which is worked as, to be computer-executed so that the computer performs the method as any one of claim 1 to 5.