CN107273454A - User data sorting technique, device, server and computer-readable recording medium - Google Patents
User data sorting technique, device, server and computer-readable recording medium Download PDFInfo
- Publication number
- CN107273454A CN107273454A CN201710401985.2A CN201710401985A CN107273454A CN 107273454 A CN107273454 A CN 107273454A CN 201710401985 A CN201710401985 A CN 201710401985A CN 107273454 A CN107273454 A CN 107273454A
- Authority
- CN
- China
- Prior art keywords
- user data
- division
- data
- life span
- grader
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/283—Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/02—Marketing; Price estimation or determination; Fundraising
- G06Q30/0201—Market modelling; Market analysis; Collecting market data
Abstract
Present disclose provides a kind of user data sorting technique, including:Produce the feature of user data;According to mark rule, the labeled data collection and unlabeled data collection of user data are produced;According to labeled data collection and unlabeled data collection, the positive sample labeled data collection P and unknown sample data set U of a classification in multiple classifications are built;According to positive sample labeled data collection P and unknown sample data set U and corresponding user data feature, grader is produced;Determine whether the user data that unlabeled data is concentrated belongs to that described classification using grader.The disclosure marks learning algorithm to classify user data by improved positive example no specimen, it is adaptable to the crowd of similar division of life span in the feature extraction of crowd, digging system, so as to provide the electric business advertisement of accurate crowd's orientation.
Description
Technical field
This disclosure relates to Internet technical field, and in particular to a kind of user data sorting technique, device, server and meter
Calculation machine readable storage medium storing program for executing.
Background technology
Market survey person and sociologist are in recent years all the more, it is realized that different classes of disappearing for instance in different division of life spans
Expense person shows different Shopping Behaviors.The division of life span that some coarsenesses can be done to consumer divides, for example, going to school the stage
(young man, and unmarried), newly-married (young man, and without child), the middle age (marries, and has 0 or multiple child), the old (age
Higher or retirement, and children live on one's own life) etc..Obviously, i.e., the people of different division of life spans' (age bracket) shows disappearing for differentiation
Take trend.For example, pregnant woman can buy folic acid, vitamin, mothers can buy corresponding business according to the age bracket of baby
Product, such as milk powder, perambulator, safety seat, educational toy.In mother and baby's channel, vertical class app in electric business website, consumption
Person's purchasing model is quite obvious.It can introduce consumer's in the accurate crowd of electric business advertisement orients business and commending system
Division of life span orients, it is hereby achieved that preferably recommendation effect.
But during the present invention is realized, inventor has found that prior art at least has following technical problem:Method
Validity highly dependent upon training data correctness and scale, simultaneously as some commodity such as mother and baby's class commodity due to its belong to
Property standard feature, such as milk powder can clearly indicate of the right age scope, itself there is very strong crowd's orientation, be used as and recommend application
May not be suitable.Accordingly, it would be desirable to a kind of method and device classified to user, preferably can classify to user, example
As more accurately and reliably excavated and there is the consumer of identical division of life span in electric business system, so as to serve the essence of electric business advertisement
Quasi- crowd's orientation.
The content of the invention
According to the first aspect of the disclosure there is provided a kind of user data sorting technique, methods described includes:Produce user
The feature of data;According to mark rule, the labeled data collection and unlabeled data collection of user data are produced;According to the mark number
According to collection and unlabeled data collection, the positive sample labeled data collection P and unknown sample data of a classification in multiple classifications are built
Collect U;According to positive sample labeled data collection P and unknown sample data set U and corresponding user data feature, classification is produced
Device;And determine whether the user data that unlabeled data is concentrated belongs to that described classification using the grader.
In one embodiment, the user data can be electric business user data, and the multiple classification is multiple life
Stage, such as mother and baby division of life span.
In one embodiment, methods described can also include judging whether the user data meets mark rule, such as
Fruit meets and is then added to labeled data concentration, and the mark rule can include:If user data indicates only to buy one
The commodity of division of life span, then between being defined as the time buying at the beginning of the division of life span, if user data indicates to buy
The commodity of multiple division of life spans and buy sequentially in time, then the time of last time purchase determine corresponding division of life span
At the beginning of between, and/or if user data indicates the commodity of excessive division of life span of purchase and not purchased sequentially in time
Buy, be then defined by earliest division of life span, the earliest time of placing an order for belonging to the division of life span is determined to the beginning of the division of life span
Time.Methods described can also include, according between at the beginning of identified division of life span, duration of each division of life span
And current time, determine which division of life span user data currently belongs to.
In one embodiment, the feature can include the classification feature of purchase commodity, ascribed characteristics of population feature with timely
Between feature, the temporal characteristics can include time buying weighted feature and the feature relevant with each division of life span.
In one embodiment, the positive sample standard data set P can include labeled data concentration and belong to the classification
User data, unknown sample data set U include by labeled data concentrate is not belonging to the classification user data and do not mark
At least a portion in the set of user data composition in data set, and produce grader and may comprise steps of:
It is sky to set grader M, and reliable negative sample set RN is sky;
A part of user data S of stochastical sampling adds U from P, updates P and U, is designated as Ps=P-S, Us=U+S;
Using Ps as positive sample, Us is used as negative sample, training logistic regression grader LRi, i=0,1 ..., as follows
(1) S setting grader threshold values th is utilized;
(2) for each sample u ∈ Us:If in LRiClassifier result be less than threshold value th, then by u add RN in,
And Us=Us-RN;
(3) M=M+LRi;
Using Ps as positive sample, RN is used as negative sample, training logistic regression grader LRi, repetition above step (1)-
(3), until meeting stopping criterion for iteration, grader LR is obtainedlast;
Use LRlastP is classified, if it exceeds the positive sample of certain number of thresholds is judged as bearing, then LR is returned to1
As final classification device, LR is otherwise returnedlastIt is used as final grader.
According to the second aspect of the disclosure there is provided a kind of user data sorter, including:Feature generation unit 701,
Mark unit 702, sample construction unit 703, grader generation unit 704 and taxon 705.The quilt of feature generation unit 701
It is configured to produce the feature of user data.Mark unit 702 is configured as, according to mark rule, producing the mark number of user data
According to collection and unlabeled data collection.Sample construction unit 703 is configured as according to the labeled data collection and unlabeled data collection, structure
The positive sample labeled data collection P and unknown sample data set U for the classification built in multiple classifications.Grader generation unit 704
The feature according to positive sample labeled data collection P and unknown sample data set U and corresponding user data is configured as, is produced
Grader.Taxon 705 is configured with the grader and determines whether the user data of unlabeled data concentration belongs to
That described classification.
In one embodiment, the user data can be electric business user data, and the multiple classification can be multiple
Division of life span, such as mother and baby division of life span.
In one embodiment, the mark unit can be additionally configured to judge whether the user data meets mark
Rule, labeled data is added to if meeting and is concentrated, and the mark rule includes:If user data indicates only to buy one
The commodity of individual division of life span, then between being defined as the time buying at the beginning of the division of life span, if user data indicates purchase
The commodity of excessive division of life span and buy sequentially in time, then the time of last time purchase determine corresponding life rank
Between at the beginning of section, and/or if user data indicates the commodity of excessive division of life span of purchase and without sequentially in time
Purchase, then be defined by earliest division of life span, the earliest time of placing an order for belonging to the division of life span determined into opening for the division of life span
Time beginning.The tag unit can be additionally configured to according to identified division of life span at the beginning of between, each division of life span
Duration and current time, determine which division of life span user data currently belongs to.
In one embodiment, the feature can include the classification feature of purchase commodity, ascribed characteristics of population feature with timely
Between feature, wherein the temporal characteristics can also include time buying weighted feature and the feature relevant with each division of life span.
In one embodiment, positive sample standard data set P can include the use that labeled data concentration belongs to the classification
User data, unknown sample data set U, which can include being concentrated by labeled data, to be not belonging to the user data of the classification and does not mark
At least a portion in the set of user data composition in data set, and grader generation unit can be additionally configured to:
It is sky to set grader M, and reliable negative sample set RN is sky;
A part of user data S of stochastical sampling adds U from P, updates P and U, is designated as Ps=P-S, Us=U+S;
Using Ps as positive sample, Us is used as negative sample, training logistic regression grader LRi, i=0,1 ..., as follows
(1) S setting grader threshold values th is utilized;
(2) for each sample u ∈ Us:If in LRiClassifier result be less than threshold value th, then by u add RN in,
And Us=Us-RN;
(3) M=M+LRi;
Using Ps as positive sample, RN is used as negative sample, training logistic regression grader LRi, repetition above step (1)-
(3), until meeting stopping criterion for iteration, grader LR is obtainedlast;
Use LRlastP is classified, if it exceeds the positive sample of certain number of thresholds is judged as bearing, then LR is returned to1
As final classification device, LR is otherwise returnedlastIt is used as final grader.
According to the third aspect of the disclosure there is provided a kind of server, including:One or more processors;Storage device, is used
In storing one or more programs, when one or more of programs are by one or more of computing devices so that described
One or more processors realize method as described in relation to the first aspect.
According to the fourth aspect of the disclosure, there is provided a kind of computer-readable recording medium, the computer-readable storage medium
Matter stores computer instruction, and the computer instruction, which is worked as, to be computer-executed so that the computer is performed such as first aspect institute
The method stated.
The present disclosure proposes improved user data sorting technique, by labeled data collection and unlabeled data collection, to instruct
Practice grader, more accurately classify so as to realize.More specifically, can be in electric business advertisement accurately crowd's orientation business
Division of life span's orientation is introduced, it can expand the orientation applied to each division of life span for including mother and baby division of life span, so that
More preferable personalized recommendation effect can be provided.
Brief description of the drawings
By description referring to the drawings to the embodiment of the present disclosure, the above-mentioned and other purposes of the disclosure, feature and
Advantage will be apparent from, in the accompanying drawings:
Fig. 1 is the synoptic diagram for showing to excavate basic procedure 100 according to the crowd of the embodiment of the present disclosure.
Fig. 2A to 2D is to show the schematic diagram for being used to mark the division of life span of user data according to the embodiment of the present disclosure;
Fig. 3 is the schematic diagram for showing the electric business bibliography system according to the tree structure of the embodiment of the present disclosure;
Fig. 4 is the flow chart for the method for showing the label of the specific division of life span of generation according to the embodiment of the present disclosure;
Fig. 5 is the schematic diagram for showing to be designed according to the ABTest labels evaluation of the embodiment of the present disclosure;
Fig. 6 is the flow chart for showing the user data sorting technique according to the embodiment of the present disclosure;
Fig. 7 is the schematic block diagram for showing the user data sorter according to the embodiment of the present disclosure;
Fig. 8 is to show that the user data sorting technique of the disclosure or the exemplary system of user data sorter can be applied
The schematic block diagram of system framework 800;And
Fig. 9 is the structural representation for showing the computer system 900 for realizing the embodiment of the present disclosure.
Embodiment
Hereinafter, it will be described with reference to the accompanying drawings embodiment of the disclosure.However, it should be understood that these descriptions are simply exemplary
, and it is not intended to limit the scope of the present disclosure.In addition, in the following description, the description to known features and technology is eliminated, with
Avoid unnecessarily obscuring the concept of the disclosure.
Term as used herein is not intended to limit the disclosure just for the sake of description specific embodiment.Used here as
Word " one ", " one (kind) " and "the" etc. should also include " multiple ", the meaning of " a variety of ", unless context clearly refers in addition
Go out.In addition, term " comprising " as used herein, "comprising" etc. indicate the presence of the feature, step, operation and/or part,
But it is not excluded that in the presence of or add one or more other features, step, operation or part.
All terms (including technology and scientific terminology) as used herein have what those skilled in the art were generally understood
Implication, unless otherwise defined.It should be noted that term used herein should be interpreted that with consistent with the context of this specification
Implication, without that should be explained with idealization or excessively mechanical mode.
Shown in the drawings of some block diagrams and/or flow chart.It should be understood that some sides in block diagram and/or flow chart
Frame or its combination can be realized by computer program instructions.These computer program instructions can be supplied to all-purpose computer,
The processor of special-purpose computer or other programmable data processing units, so that these instructions can be with when by the computing device
Create the device for realizing function/operation illustrated in these block diagrams and/or flow chart.
Hereinafter, the disclosure will illustrate to enter user data by taking the excavation of division of life span (for example, mother and baby division of life span) as an example
Row is excavated classifies in other words, but those skilled in the art are it will be appreciated that the disclosure can also be expanded applied to other classification.
Fig. 1 shows the synoptic diagram that basic procedure 100 is excavated according to the crowd of the embodiment of the present disclosure.
As shown in figure 1, excavating basic procedure 100 according to the crowd of the embodiment can include in 120 carrying out data
Mark.For example, user data can be obtained from data warehouse 110, the buying behavior of electric business user is analyzed, and define rationally rule
Then, labeled data collection and unlabeled data collection are produced to automate mark, it is explained in detail below.
In addition, the crowd excavates construction feature during basic procedure 100 can be included in 130.For example, can be from data warehouse
110 obtain user data, and the feature for being available for training is extracted from the buying behavior of user.
Labeled data collection and unlabeled data collection and the feature construction operation 130 obtained in data labeling operation 120
In obtained feature can send into grader and produce in model 140, to produce grader.For example, it is possible to use labeled data
The training characteristics of collection and unlabeled data collection and composition, carry out positive example and unmarked sample study, and pass through this learning process
Sorter model is generated, to be labeled for a large amount of Unlabeled datas.Specifically, as shown in figure 1, utilizing positive example no specimen mark
Remember learning algorithm, iteratively produce logistic regression (LR) grader.Although fig 1 illustrate that two LR graders, but this
Art personnel are appreciated that this is only example, and expression iteratively produces grader.
Furthermore it is also possible to carry out effect assessment in the grader produced by 150 pairs.It is, for example, possible to use on test set, line
A/B, which is tested, to be evaluated the classification results of grader.
Each above-mentioned process step described in detail below.The disclosure illustrates the disclosure so that mother and baby crowd excavates as an example
Particular content.For example, data area is in January, 2015-consumer's consumer behavior on the website of Jingdone district in December.For sake of convenience,
Mother and baby crowd can be divided to following each stage, and different label crowds are represented with letter Lx, referring to following table.
The mother and baby crowd's stage of table 1 and label value
Label value | Mother and baby division of life span |
L0 | Pregnancy |
L1 | Baby 0-3 months |
L2 | Baby 3-6 months |
L3 | Baby 6-12 months |
L4 | Baby 12-24 months |
Labeled data
It is possible, firstly, to do statistical analysis to user data, the abnormal user of the amount of placing an order within a period of time is removed.For example,
The frequency superelevation that placed an order in nearest 1 year and extremely low user, this certain customers are considered as having brush single act or consumption feature not to show
Write.
Secondly, to the user after filtering, it can first determine when to enter some mother and baby's stage, it is special further according to its behavior
Levy and determine which mother and baby's stage (for example, confirmation on temporal continuity or role) it is currently likely to be at.
Hereinafter, how description is determined when user enters some mother and baby's stage.
Some commodity are only applicable to certain class mother and baby's stage, for example, the user in L0 can more likely buy exposure suit or
Folic acid.By, by interim division, can substantially judge purchase crowd mother and baby's stage residing at present to mother and baby's commodity.No
The corresponding feature commodity and item property is whole can manage into similar table below with the stage.
Corresponding feature commodity of the different mother and baby's stages of table 2
In order to follow the trail of buying behavior sequence of user's whole year to mother and baby's commodity, an order can be established for each user
Each stage behavioral statisticses table of mother and baby, records the order total amount that each user's purchase belongs to some mother and baby's stage, for the first time and most
Secondary time buying afterwards.Following model can be also served as with mother and baby's state of some users of preliminary judgement by these statistics
The foundation characteristic of prediction.
Fig. 2A shows the behavioral statisticses in each mother and baby stage, and it records each user and buys each stage mother and baby's class commodity
Relevant information, for example, purchase belongs to the order total amount of Lx (x=0,1,2,3,4) commodity, initial purchase and belongs to Lx commodity
Time, and last time purchase belong to the time of Lx commodity.
After each mother and baby's stage behavioral statistics of generation user, it can determine that user is having according to following mark rule
In which and initially enter time in this stage in buying behavior in mother and baby's stage:
Rule one, user placed an order not across multiple mother and baby's stages.At this moment, user, which only descended, belongs to a certain mother and baby's stage commodity
Order.For example, as shown in Figure 2 B, user have purchased the commodity (1 time or multiple) for belonging to the L4 stages, not buy the L0-L3 stages
Commodity.Such case judges that user is counted from the early time point that once places an order and is in the L4 stages.Belong to L4 because user only descended
The order in stage, thus output user (L4,2015-11-23), show the user from November 23rd, 2015 initially enter L4 this
The individual stage.
In single span multiple mother and baby's stages under rule two, user, this is further subdivided into two kinds of situations:
(a) time that places an order in multiple stages does not intersect.I.e. user is, according to time sequence, sequentially to have purchased mother and baby's stage
Product, then corresponding division of life span is defined by the commodity bought for the last time, with placing an order the time earliest for that division of life span
Count.For example, as shown in Figure 2 C, the commodity that user finally places an order are in the L4 stages, so output user (L4,2015-12-
21), show that the user initially entered this mother and baby's stage of L4 from December 21st, 2015.
(b) there is intersection the time of placing an order in multiple stages.I.e. user have purchased the commodity of multiple division of life spans, and corresponding people
The raw stage is not in accordance with time sequencing evolution.For example, user's (actual capabilities are gestations), has first bought a perambulator
(it is assumed that 3-6 months use), it is rear to have bought a nipple (it is assumed that 0-3 months use) again, then to buy right in commodity with user
Division of life span's being defined earliest is answered, it is approximate to judge that user is in the L1 stages (0-3 months), and being placed an order the time with the earliest of L1 stages
Count.For example, as shown in Figure 2 D, the commodity that user finally places an order are in the L1 stages, so output user (L1,2015-09-
10), show that the user enters the L1 stages in 2015-09-10.
Determining after when user enter some mother and baby's stage, it is possible to according to each stage institute divided in advance
Which duration, to calculate the user into after mother and baby's stage, should be in current (for example, on December 31st, 2015)
Mother and baby's stage.This partial data (that is, will be noted as belonging to specific mother for the positive example sample for setting up sorter model training
The user data in baby's stage), it will be described in later.
It should be noted that notwithstanding can calculate which mother and baby user should be at using above-mentioned mark rule
Stage, but above-mentioned rule may can only cover a part of user data and produce labeled data.That is, described rule may not
Can all user data of covering, at this moment, rule coverage less than user data will form unlabeled data, need to pass through in the future
Grader is classified.
Thus, by labeled data 120, labeled data collection can be produced by the user data of data warehouse 110 and do not marked
Note data set.
Construction feature
Construction feature is needed to produce the input of model 140 as grader before training pattern, feature used can be wrapped
Include following groups:Classification feature, user's ascribed characteristics of population feature and temporal characteristics, it is described separately below.
Classification feature
In general, each electric business carrys out display goods with level classification.For example, Jingdone district is not belonged to together with three-level classification to display
Property commodity, be easy to user quickly to navigate to required commodity.For example, Fig. 3 shows the tree-like knot according to the embodiment of the present disclosure
The electric business bibliography system of structure.Wherein, the commodity in Jingdone district store include:The household electrical appliance ... ... of first order classification, books, phonotapes and videotapes electricity
Philosophical works etc.;The big household electrical appliances ... ... of second level classification under household electrical appliance, individual shield health, hardware house ornamentation etc.;And under big household electrical appliances
The flat panel TV of third level classification, air-conditioning, washing machine etc..
User's purchase commodity reflect his demand at that time or in a period of time from now on.For example, the mother at the initial stage of pregnancy
It is more likely to buy maternity dress, exposure suit etc., and the later stage is possible to buy the arrival for her child such as diaper, milk powder, infanette
Prepare in advance.But then not necessarily it is concerned about for the diaper for buying what brand (such as helping precious suitable or flower king or other brands).
Therefore, buying behavior of the selection user to three-level classification can just describe the demand of user with finer grain, and can be same
Type commodity return into a class.
, can be user as document, each classification as occurring in document in order to reduce the influence of popular classification
Word, calculates user TF-IDF (word frequency-reverse document-frequency) values to build classification characteristic vector.
User's ascribed characteristics of population feature
Generally, the consumer behavior of user is relevant with the ascribed characteristics of population feature of user.For example, the user of different age group, property
Not, all can be variant in the consumption habit of user in difference such as member's grades (often embodying its consuming capacity) of electric business
Embody.The disclosure uses registered user's information of electric business website, the Shopping Behaviors of user, extracts the spy of multiple user's dimensions
Levy, be referred to as " user's portrait ".It is as shown in the table, an example of user's ascribed characteristics of population feature.
The user's ascribed characteristics of population feature of table 3
Temporal characteristics
Temporal characteristics can include the temporal characteristics relevant for example with each division of life span's (for example, mother and baby's stage) and time
Weighted feature.
The temporal characteristics relevant with each division of life span.For example, user buys maternity dress and bought pregnant before January the year before
Woman's dress is to speculating that its which present residing mother and baby has very big difference the stage, and the latter is more likely to belong to the L0 stages;Meanwhile, if user
Repeatedly purchase belong to certain mother and baby's stage (L0) commodity, then can substantially speculate user experienced in this stage how long, one
The user that experienced 9 months period of pregnancys is more likely to purchase next stage than the user of a period of pregnancy that experienced 2 months
(L1) commodity.Therefore, the disclosure proposes exemplary mother and baby's commodity purchasing feature as shown in the table.
The user of table 4 buys the temporal characteristics of each mother and baby's stage commodity
Time weight feature.Similarly, for example, a user bought commodity and commodity were bought before January to it the year before
Present liveness also has very big difference, and the latter more likely buys commodity again in a short time.Define time weight feature public
Formula is as follows:
Wherein λ is decay factor, and the disclosure can be the timestamp on December 31st, 2015, ti by 5.0/365, T of value
For the date and time stamp that places an order of user's ith, m is the total degree that places an order of this user.
Finally, each category feature is normalized during training, is organized into the eigenmatrix of various dimensions, wherein each user
Characteristic vector homography a line, it is as follows:
Sorter model is produced
In the disclosure, sorter model, which is produced, can include positive example and without mark sample learning.Disclosure application half is supervised
Educational inspector's learning method (PU-Learning) realizes that crowd extends.As described above, can only obtain small-scale positive sample by marking rule
This labeled data, and reliable negative sample set can not be marked, therefore can not directly train reliable disaggregated model.The disclosure
The classification problem of only a small amount of positive example and a large amount of unknown samples is solved by application positive example and the learning method of no specimen mark.
Specifically, the present disclosure proposes the algorithm that one kind can be described as " spy's technology ", it adds positive example by certain sample rate unmarked
Training pattern obtains reliably bearing example in sample, and the sample rate, which refers to extract, is mixed into positive example data volume in data untagged always just
The total accounting of example data volume.
The algorithm basic thought is as follows:
Due to unreliable negative sample, so initial reliable negative sample set RN is sky, portion is randomly selected from positive sample P
Divided data S is added in unknown sample U, can obtain Ps and Us, is training initial logic time after tagged 1, the Us of Ps tagged 0
Return grader, recycle S data collection given threshold to classify for whole U, the data W of grader mark 0 is added in RN, hereafter
Grader is trained with Ps and RN, is remaining U classification, classification, which is obtained 0 sample, to be added in RN, is iterated until meeting eventually
Only condition.In a word, premised on ensureing the classification accuracy rate to positive example, positive example and the no specimen mark each iteration of learning algorithm can
To expand reliable negative sample collection.
Generally, it is applied to two classification problems positive example and no specimen mark learning method more, and mother and baby division of life span divides category
In many classification problems, classification problem more than one is changed into multiple two classification problems by the disclosure using one-vs-rest.
Complete positive example and no specimen mark learning algorithm flow based on " spy's technology " can be summarized as follows:
Algorithm positive example and no specimen mark learning algorithm flow
In above-mentioned algorithm, positive example and no specimen mark learning algorithm have some parameters to need setting, such as sample rate s%
With threshold value th.To prevent the positive sample trained from very little, causing S plays the role of certain planning can reach " spy " again, this
It is open to use such as 15% sample rate.Ideally, to the threshold value th of the model specification produced in each iteration
So that whole S data collection can correctly be divided into positive example, but due to there is noise in data, th setting ensures model to S data
Collection classification it is accurate between such as 80%-100%, the disclosure with ensure to S classify accuracy such as 95%
Th threshold values are set.
Fig. 4 shows the flow chart of the method 400 according to the generation of embodiment of the present disclosure pregnancy (L0) phase tag.Ability
Field technique personnel, which are appreciated that, is also applied for generation L1 with same method to L4 phase tags.
Method 400 is included in step 401 and started.Then in step 402, judge whether user data meets automation mark
Rule.In this way, then the positive example of user data is obtained in step 403, that is to say, that obtain labeled data collection.At this moment, labeled data
Collection can include L0 to the user data in each stage of L4.If being no in step 402, corresponding user data is constituted and not marked
Data are noted, i.e., automation mark rule can not cover user data.
Next, in step 405, whether the positive example user data that judgment step 403 is obtained belongs to the L0 stages, because below
Two classification will be carried out using one-vs-rest mode.In this way, then corresponding user data is labeled as 1, and can be according to
8: 1: 1 ratio, random generation training dataset P, validation data set and test set.It will be understood by those skilled in the art that can
So that positive example data are divided into training dataset, validation data set and test set according to a certain percentage to produce more accurately and reliably
Grader, and this proportionate relationship is not limited to the above.For the unlabeled data produced in step 404 and pass through step
405 judge to be not belonging to the user data in L0 stages, can merge them, not marked therefrom to extract to produce according to a certain percentage
Remember data set U.For example, can according to training dataset P, to P with 1: 10 ratio the stochastical sampling data from merging data, with
Produce Unlabeled data collection U.That is, the ratio between positive example sample and the user data of unmarked sample are 1: 10.
In addition, except positive example training dataset P and Unlabeled data collection U, in addition it is also necessary to which the feature of user data is used as just
Example and no specimen mark the input of learning algorithm.Therefore, in step 408, the feature of user data can be extracted.By upper, in step
Rapid 409, mark learning algorithm to produce grader by positive example and no specimen, detailed process may be referred to flow described above.
Then, in step 410, using produced grader, unlabeled data is classified, if in step 411
1 is output as, then corresponding user data is stamped into L0 labels in step 412.Step 413, method 400 terminates.
Repeat similar method, it is possible to which all customer data is categorized into each division of life span.
Effect assessment
In addition to carrying out cross validation to disaggregated model offline and evaluating, the disclosure have also been devised verification machine on a kind of ABTest lines
System, it can be verified by verifying label quality and operational indicator on ABTest lines by the prediction to electric business consumer demographics
The reliability of Result.
As shown in figure 5, showing that the evaluation of ABTest labels is designed, whether tagging user is hit from flow end according to exposure,
It is 3 set wherein traffic partition.Set A:Represent the exposure for participating in testing;Set B:When representing to ask in A with user
Exposure with label L still to be tested;C gathers:Represent the exposure triggered in B by orientation label L.Here, with mother and baby crowd's label
Exemplified by checking, to weigh the value that L0 orients label, comparison of design experiment is as follows:
Exp-base:Pv (page browsing amount, page view) samples, benchmarks, represents proper use of label L0 number
According to;
Exp-random:Pv samples, random experiments, and (implementation is the data for representing using random orientation label L0
User u1 and u2 are randomly choosed, their orientation L0 label datas are exchanged, other are constant);
Exp-unuse:Pv samples, and without using orientation label L0, manually removes the L0 label datas with user.
By ABTest systems, in set B, Exp-base and Exp-random, Exp-unuse are contrasted respectively, in C collection
Close, contrast Exp-base and Exp-random, observation such as CPM (Cost Per Mille, thousand displayings are paid), CTR
Advertising business indexs such as (Click Through Rate, clicking rate) etc..
Fig. 6 shows a kind of user data sorting technique 600 according to the embodiment of the present disclosure, and methods described 600 includes:
Step 601, the feature of user data is produced;In step 602, according to mark rule, produce user data labeled data collection and
Unlabeled data collection;In step 603, according to the labeled data collection and unlabeled data collection, build multiple (for example, more than 2
It is individual) the positive sample labeled data collection P and unknown sample data set U of a classification in classification;In step 604, according to positive sample
Labeled data collection P and unknown sample data set U and corresponding user data feature, produce grader;And in step
605, determine whether the user data that unlabeled data is concentrated belongs to that described classification using the grader.
In one embodiment, the user data can be electric business user data, and the multiple classification is multiple life
Stage.
In one embodiment, methods described 600 can also include judging whether the user data meets mark rule,
It is added to labeled data if meeting to concentrate, the mark rule can include:If user data indicates only to buy one
The commodity of individual division of life span, then between being defined as the time buying at the beginning of the division of life span, if user data indicates purchase
The commodity of excessive division of life span and buy sequentially in time, then the time of last time purchase determine corresponding life rank
Between at the beginning of section, and/or if user data indicates the commodity of excessive division of life span of purchase and without sequentially in time
Purchase, then be defined by earliest division of life span, the earliest time of placing an order for belonging to the division of life span determined into opening for the division of life span
Time beginning.Methods described can also include, according between at the beginning of identified division of life span, each division of life span it is lasting when
Between and current time, determine which division of life span user data currently belongs to.
In one embodiment, the feature can include the classification feature of purchase commodity, ascribed characteristics of population feature with timely
Between feature, the temporal characteristics can include time buying weighted feature and the feature relevant with each division of life span.
In one embodiment, the positive sample standard data set P can include labeled data concentration and belong to the classification
User data, unknown sample data set U include by labeled data concentrate is not belonging to the classification user data and do not mark
At least a portion in the set of user data composition in data set, and produce grader and may comprise steps of:
It is sky to set grader M, and reliable negative sample set RN is sky;
A part of user data S of stochastical sampling adds U from P, updates P and U, is designated as Ps=P-S, Us=U+S;
Using Ps as positive sample, Us is used as negative sample, training logistic regression grader LRi, i=0,1 ..., as follows
(1) S setting grader threshold values th is utilized;
(2) for each sample u ∈ Us:If in LRiClassifier result be less than threshold value th, then by u add RN in,
And Us=Us-RN;
(3) M=M+LRi;
Using Ps as positive sample, RN is used as negative sample, training logistic regression grader LRi, repetition above step (1)-
(3), until meeting stopping criterion for iteration, grader LR is obtainedlast;
Use LRlastP is classified, if it exceeds the positive sample of certain number of thresholds is judged as bearing, then LR is returned to1
As final classification device, LR is otherwise returnedlastIt is used as final grader.
Fig. 7 shows the user data sorter 700 according to the embodiment of the present disclosure.User data sorter 700 is wrapped
Include:Feature generation unit 701, mark unit 702, sample construction unit 703, grader generation unit 704 and taxon
705.Feature generation unit 701 is configured as producing the feature of user data.Mark unit 702 is configured as according to mark rule
Then, the labeled data collection and unlabeled data collection of user data are produced.Sample construction unit 703 is configured as according to the mark
Data set and unlabeled data collection, build the positive sample labeled data collection P and unknown sample number of a classification in multiple classifications
According to collection U.Grader generation unit 704 is configured as according to positive sample labeled data collection P and unknown sample data set U and relative
The feature for the user data answered, produces grader.Taxon 705 is configured with the grader and determines unlabeled data
Whether the user data of concentration belongs to that described classification.
In one embodiment, the user data can be electric business user data, and the multiple classification can be multiple
Division of life span.
In one embodiment, the mark unit can be additionally configured to judge whether the user data meets mark
Rule, labeled data is added to if meeting and is concentrated, and the mark rule includes:If user data indicates only to buy one
The commodity of individual division of life span, then between being defined as the time buying at the beginning of the division of life span, if user data indicates purchase
The commodity of excessive division of life span and buy sequentially in time, then the time of last time purchase determine corresponding life rank
Between at the beginning of section, and/or if user data indicates the commodity of excessive division of life span of purchase and without sequentially in time
Purchase, then be defined by earliest division of life span, the earliest time of placing an order for belonging to the division of life span determined into opening for the division of life span
Time beginning.The tag unit can be additionally configured to according to identified division of life span at the beginning of between, each division of life span
Duration and current time, determine which division of life span user data currently belongs to.
In one embodiment, the feature can include the classification feature of purchase commodity, ascribed characteristics of population feature with timely
Between feature, wherein the temporal characteristics can also include time buying weighted feature and the feature relevant with each division of life span.
In one embodiment, positive sample standard data set P can include the use that labeled data concentration belongs to the classification
User data, unknown sample data set U, which can include being concentrated by labeled data, to be not belonging to the user data of the classification and does not mark
At least a portion in the set of user data composition in data set, and grader generation unit can be additionally configured to:
It is sky to set grader M, and reliable negative sample set RN is sky;
A part of user data S of stochastical sampling adds U from P, updates P and U, is designated as Ps=P-S, Us=U+S;
Using Ps as positive sample, Us is used as negative sample, training logistic regression grader LRi, i=0,1 ..., as follows
(1) S setting grader threshold values th is utilized;
(2) for each sample u ∈ Us:If in LRiClassifier result be less than threshold value th, then by u add RN in,
And Us=Us-RN;
(3) M=M+LRi;
Using Ps as positive sample, RN is used as negative sample, training logistic regression grader LRi, repetition above step (1)-
(3), until meeting stopping criterion for iteration, grader LR is obtainedlast;
Use LRlastP is classified, if it exceeds the positive sample of certain number of thresholds is judged as bearing, then LR is returned to1
As final classification device, LR is otherwise returnedlastIt is used as final grader.
Fig. 8 shows user data sorting technique or the exemplary system of user data sorter that can be using the disclosure
System framework 800.
As shown in figure 8, system architecture 800 can include terminal device 801,802,803, network 804 and server 805.
Medium of the network 804 to provide communication link between terminal device 801,802,803 and server 805.Network 804 can be with
Including various connection types, such as wired, wireless communication link or fiber optic cables etc..
User can be interacted with using terminal equipment 801,802,803 by network 804 with server 805, to receive or send out
Send message etc..Various telecommunication customer end applications can be installed, class of for example doing shopping application, net on terminal device 801,802,803
(merely illustrative) such as the application of page browsing device, searching class application, JICQ, mailbox client, social platform softwares.
Terminal device 801,802,803 can be the various electronic equipments browsed with display screen and supported web page, bag
Include but be not limited to smart mobile phone, tablet personal computer, pocket computer on knee and desktop computer etc..
Server 805 can be to provide the server of various services, for example, utilize terminal device 801,802,803 to user
The shopping class website browsed provides the back-stage management server (merely illustrative) supported.Back-stage management server can be to receiving
To the data such as information query request carry out the processing such as analyzing, and by result (such as target push information, product letter
Breath -- merely illustrative) feed back to terminal device.
It should be noted that the user data sorting technique that the embodiment of the present application is provided typically can be by server 805
Perform, correspondingly, user data sorter can be typically arranged in server 805.
It should be understood that the number of the terminal device, network and server in Fig. 8 is only schematical.According to realizing need
Will, can have any number of terminal device, network and server.
Below with reference to Fig. 9, it illustrates the structural representation for being suitable for use in the computer system 900 for realizing the embodiment of the present disclosure
Figure.Computer system shown in Fig. 9 is only an example, should not be appointed to the function of the embodiment of the present disclosure and using range band
What is limited.
As shown in figure 9, computer system 900 includes CPU (CPU) 901, it can be read-only according to being stored in
Program in memory (ROM) 902 or be loaded into program in random access storage device (RAM) 903 from storage part 908 and
Perform various appropriate actions and processing.In RAM 903, the system that is also stored with 900 operates required various programs and data.
CPU 901, ROM 902 and RAM 903 are connected with each other by bus 904.Input/output (I/O) interface 905 is also connected to always
Line 904.
I/O interfaces 905 are connected to lower component:Importation 906 including keyboard, mouse etc.;Penetrated including such as negative electrode
The output par, c 907 of spool (CRT), liquid crystal display (LCD) etc. and loudspeaker etc.;Storage part 908 including hard disk etc.;
And the communications portion 909 of the NIC including LAN card, modem etc..Communications portion 909 via such as because
The network of spy's net performs communication process.Driver 910 is also according to needing to be connected to I/O interfaces 905.Detachable media 911, such as
Disk, CD, magneto-optic disk, semiconductor memory etc., are arranged on driver 910, in order to read from it as needed
Computer program be mounted into as needed storage part 908.
Especially, in accordance with an embodiment of the present disclosure, the process described above with reference to flow chart may be implemented as computer
Software program.For example, embodiment of the disclosure includes a kind of computer program product, it includes being carried on computer-readable medium
On computer program, the computer program include be used for execution flow chart shown in method program code.In such reality
Apply in example, the computer program can be downloaded and installed by communications portion 909 from network, and/or from detachable media
911 are mounted.When the computer program is performed by CPU (CPU) 901, limited in the system for performing the disclosure
Above-mentioned functions.
It should be noted that the computer-readable medium shown in the application can be computer-readable signal media or meter
Calculation machine readable storage medium storing program for executing either the two any combination.Computer-readable recording medium for example can be --- but not
Be limited to --- electricity, magnetic, optical, electromagnetic, system, device or the device of infrared ray or semiconductor, or it is any more than combination.Meter
The more specifically example of calculation machine readable storage medium storing program for executing can include but is not limited to:Electrical connection with one or more wires, just
Take formula computer disk, hard disk, random access storage device (RAM), read-only storage (ROM), erasable type and may be programmed read-only storage
Device (EPROM or flash memory), optical fiber, portable compact disc read-only storage (CD-ROM), light storage device, magnetic memory device,
Or above-mentioned any appropriate combination.In this application, computer-readable recording medium can any include or store journey
The tangible medium of sequence, the program can be commanded execution system, device or device and use or in connection.And at this
In application, computer-readable signal media can be included in a base band or as the data-signal of carrier wave part propagation,
Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including but not limit
In electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be that computer can
Any computer-readable medium beyond storage medium is read, the computer-readable medium, which can send, propagates or transmit, to be used for
Used by instruction execution system, device or device or program in connection.Included on computer-readable medium
Program code can be transmitted with any appropriate medium, be included but is not limited to:Wirelessly, electric wire, optical cable, RF etc., or above-mentioned
Any appropriate combination.
Flow chart and block diagram in accompanying drawing, it is illustrated that according to the system of the various embodiments of the application, method and computer journey
Architectural framework in the cards, function and the operation of sequence product.At this point, each square frame in flow chart or block diagram can generation
The part of one module of table, program segment or code, a part for above-mentioned module, program segment or code is comprising one or more
Executable instruction for realizing defined logic function.It should also be noted that in some realizations as replacement, institute in square frame
The function of mark can also be with different from the order marked in accompanying drawing generation.For example, two square frames succeedingly represented are actual
On can perform substantially in parallel, they can also be performed in the opposite order sometimes, and this is depending on involved function.Also
It is noted that the combination of each square frame in block diagram or flow chart and the square frame in block diagram or flow chart, can use and perform rule
Fixed function or the special hardware based system of operation realize, or can use the group of specialized hardware and computer instruction
Close to realize.
Claims (12)
1. a kind of user data sorting technique, including:
Produce the feature of user data;
According to mark rule, the labeled data collection and unlabeled data collection of user data are produced;
According to the labeled data collection and unlabeled data collection, the positive sample labeled data of a classification in multiple classifications is built
Collect P and unknown sample data set U;
According to positive sample labeled data collection P and unknown sample data set U and corresponding user data feature, classification is produced
Device;
Determine whether the user data that unlabeled data is concentrated belongs to that described classification using the grader.
2. according to the method described in claim 1, wherein, the user data is electric business user data, and the multiple classification is
Multiple division of life spans.
3. method according to claim 2, in addition to judge whether the user data meets mark rule, if met
Labeled data concentration is then added to, the mark rule includes:
If user data indicates only to buy the commodity of a division of life span, the time buying is defined as the division of life span's
Time started,
If user data indicates the commodity of excessive division of life span of purchase and bought sequentially in time that last time is bought
Time determine at the beginning of corresponding division of life span between, and/or
If user data indicates the commodity of excessive division of life span of purchase and not bought sequentially in time, with earliest
Division of life span is defined, between the earliest time of placing an order for belonging to the division of life span is determined at the beginning of the division of life span;
Wherein, methods described also includes, according between at the beginning of identified division of life span, duration of each division of life span
And current time, determine which division of life span user data currently belongs to.
4. method according to claim 2, wherein, it is special that the feature includes the classification feature of purchase commodity, the ascribed characteristics of population
Levy and temporal characteristics, wherein the temporal characteristics include time buying weighted feature and the spy relevant with each division of life span
Levy.
5. according to the method described in claim 1, wherein, positive sample standard data set P includes labeled data and concentrated to belong to described
The user data of classification, unknown sample data set U includes concentrating the user data and not for being not belonging to the classification by labeled data
At least a portion in the set for the user data composition that labeled data is concentrated, and produce grader and comprise the following steps:
It is sky to set grader M, and reliable negative sample set RN is sky;
A part of user data S of stochastical sampling adds U from P, updates P and U, is designated as Ps=P-S, Us=U+S;
Using Ps as positive sample, Us is used as negative sample, training logistic regression grader LRi, i=0,1 ..., it is as follows
(1) S setting grader threshold values th is utilized;
(2) for each sample u ∈ Us:If in LRiClassifier result be less than threshold value th, then by u add RN in, and
Us=Us-RN;
(3) M=M+LRi;
Using Ps as positive sample, RN is used as negative sample, training logistic regression grader LRi, above step (1)-(3) are repeated, directly
To stopping criterion for iteration is met, grader LR is obtainedlast;
Use LRlastP is classified, if it exceeds the positive sample of certain number of thresholds is judged as bearing, then LR is returned to1As
Final classification device, otherwise returns to LRlastIt is used as final grader.
6. a kind of user data sorter, including:
Feature generation unit, is configured as producing the feature of user data;
Unit is marked, is configured as, according to mark rule, producing the labeled data collection and unlabeled data collection of user data;
Sample construction unit, is configured as, according to the labeled data collection and unlabeled data collection, building one in multiple classifications
The positive sample labeled data collection P and unknown sample data set U of individual classification;
Grader generation unit, is configured as according to positive sample labeled data collection P and unknown sample data set U and corresponding
The feature of user data, produces grader;
Taxon, be configured with the user data that the grader determines that unlabeled data is concentrated whether belong to it is described that
One classification.
7. device according to claim 6, wherein, the user data is electric business user data, and the multiple classification is
Multiple division of life spans.
8. device according to claim 7, wherein the mark unit is additionally configured to whether judge the user data
Mark rule is met, labeled data is added to if meeting and is concentrated, the mark rule includes:
If user data indicates only to buy the commodity of a division of life span, the time buying is defined as the division of life span's
Time started,
If user data indicates the commodity of excessive division of life span of purchase and bought sequentially in time that last time is bought
Time determine at the beginning of corresponding division of life span between, and/or
If user data indicates the commodity of excessive division of life span of purchase and not bought sequentially in time, with earliest
Division of life span is defined, between the earliest time of placing an order for belonging to the division of life span is determined at the beginning of the division of life span;
Wherein, between at the beginning of the tag unit is additionally configured to according to identified division of life span, each division of life span
Duration and current time, determine which division of life span user data currently belongs to.
9. device according to claim 7, wherein, it is special that the feature includes the classification feature of purchase commodity, the ascribed characteristics of population
Levy and temporal characteristics, wherein the temporal characteristics include time buying weighted feature and the spy relevant with each division of life span
Levy.
10. device according to claim 6, wherein, positive sample standard data set P includes labeled data concentration and belongs to described
The user data of classification, unknown sample data set U includes concentrating the user data and not for being not belonging to the classification by labeled data
At least a portion in the set for the user data composition that labeled data is concentrated, and grader generation unit is additionally configured to:
It is sky to set grader M, and reliable negative sample set RN is sky;
A part of user data S of stochastical sampling adds U from P, updates P and U, is designated as Ps=P-S, Us=U+S;
Using Ps as positive sample, Us is used as negative sample, training logistic regression grader LRi, i=0,1 ..., it is as follows
(1) S setting grader threshold values th is utilized;
(2) for each sample u ∈ Us:If in LRiClassifier result be less than threshold value th, then by u add RN in, and
Us=Us-RN;
(3) M=M+LRi;
Using Ps as positive sample, RN is used as negative sample, training logistic regression grader LRi, above step (1)-(3) are repeated, directly
To stopping criterion for iteration is met, grader LR is obtainedlast;
Use LRlastP is classified, if it exceeds the positive sample of certain number of thresholds is judged as bearing, then LR is returned to1As
Final classification device, otherwise returns to LRlastIt is used as final grader.
11. a kind of server, including:
One or more processors;
Storage device, for storing one or more programs,
When one or more of programs are by one or more of computing devices so that one or more of processors are real
The existing method as any one of claim 1 to 5.
12. a kind of computer-readable recording medium, the computer-readable recording medium storage computer instruction, the computer
Instruction, which is worked as, to be computer-executed so that the computer performs the method as any one of claim 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710401985.2A CN107273454B (en) | 2017-05-31 | 2017-05-31 | User data classification method, device, server and computer readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710401985.2A CN107273454B (en) | 2017-05-31 | 2017-05-31 | User data classification method, device, server and computer readable storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107273454A true CN107273454A (en) | 2017-10-20 |
CN107273454B CN107273454B (en) | 2020-11-03 |
Family
ID=60065763
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710401985.2A Active CN107273454B (en) | 2017-05-31 | 2017-05-31 | User data classification method, device, server and computer readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107273454B (en) |
Cited By (32)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108256907A (en) * | 2018-01-09 | 2018-07-06 | 北京腾云天下科技有限公司 | A kind of construction method and computing device of customer grouping model |
CN108305099A (en) * | 2018-01-18 | 2018-07-20 | 阿里巴巴集团控股有限公司 | Determine the method and device for buying user on behalf |
CN108364192A (en) * | 2018-01-16 | 2018-08-03 | 阿里巴巴集团控股有限公司 | A kind of usage mining method, apparatus and electronic equipment |
CN108399418A (en) * | 2018-01-23 | 2018-08-14 | 北京奇艺世纪科技有限公司 | A kind of user classification method and device |
CN109087145A (en) * | 2018-08-13 | 2018-12-25 | 阿里巴巴集团控股有限公司 | Target group's method for digging, device, server and readable storage medium storing program for executing |
CN109191167A (en) * | 2018-07-17 | 2019-01-11 | 阿里巴巴集团控股有限公司 | A kind of method for digging and device of target user |
CN109325525A (en) * | 2018-08-31 | 2019-02-12 | 阿里巴巴集团控股有限公司 | Sample attribute assessment models training method, device and server |
CN109801091A (en) * | 2017-11-16 | 2019-05-24 | 腾讯科技(深圳)有限公司 | Targeted user population localization method, device, computer equipment and storage medium |
CN109840788A (en) * | 2017-11-27 | 2019-06-04 | 北京京东尚科信息技术有限公司 | For analyzing the method and device of user behavior data |
WO2019114481A1 (en) * | 2017-12-13 | 2019-06-20 | 腾讯科技(深圳)有限公司 | Cluster type recognition method, apparatus, electronic apparatus, and storage medium |
CN109948730A (en) * | 2019-03-29 | 2019-06-28 | 中诚信征信有限公司 | A kind of data classification method, device, electronic equipment and storage medium |
CN109961308A (en) * | 2017-12-25 | 2019-07-02 | 北京京东尚科信息技术有限公司 | The method and apparatus of assessment tag data |
CN110322281A (en) * | 2019-06-06 | 2019-10-11 | 阿里巴巴集团控股有限公司 | The method for digging and device of similar users |
CN110392899A (en) * | 2017-12-18 | 2019-10-29 | 甲骨文国际公司 | The dynamic feature selection generated for model |
CN110428295A (en) * | 2018-08-01 | 2019-11-08 | 北京京东尚科信息技术有限公司 | Method of Commodity Recommendation and system |
CN110458641A (en) * | 2019-06-28 | 2019-11-15 | 苏宁云计算有限公司 | A kind of electric business recommended method and system |
CN110597984A (en) * | 2019-08-12 | 2019-12-20 | 大箴(杭州)科技有限公司 | Method and device for determining abnormal behavior user information, storage medium and terminal |
CN110706049A (en) * | 2018-07-10 | 2020-01-17 | 北京京东尚科信息技术有限公司 | Data processing method and device |
CN110796171A (en) * | 2019-09-27 | 2020-02-14 | 北京淇瑀信息科技有限公司 | Unclassified sample processing method and device of machine learning model and electronic equipment |
CN110796482A (en) * | 2019-09-27 | 2020-02-14 | 北京淇瑀信息科技有限公司 | Financial data classification method and device for machine learning model and electronic equipment |
CN110807546A (en) * | 2019-10-22 | 2020-02-18 | 恒大智慧科技有限公司 | Community grid population change early warning method and system |
CN110807547A (en) * | 2019-10-22 | 2020-02-18 | 恒大智慧科技有限公司 | Method and system for predicting family population structure |
CN110826579A (en) * | 2018-08-07 | 2020-02-21 | 北京京东尚科信息技术有限公司 | Commodity classification method and device |
CN111225009A (en) * | 2018-11-27 | 2020-06-02 | 北京沃东天骏信息技术有限公司 | Method and apparatus for generating information |
CN111325228A (en) * | 2018-12-17 | 2020-06-23 | 上海游昆信息技术有限公司 | Model training method and device |
CN111340053A (en) * | 2018-12-03 | 2020-06-26 | 北京嘀嘀无限科技发展有限公司 | Order classification method, classification system, computer device and readable storage medium |
CN111401962A (en) * | 2020-03-20 | 2020-07-10 | 上海络昕信息科技有限公司 | Key opinion consumer mining method, device, equipment and medium |
CN111612519A (en) * | 2020-04-13 | 2020-09-01 | 广发证券股份有限公司 | Method, device and storage medium for identifying potential customers of financial product |
CN112580681A (en) * | 2019-09-30 | 2021-03-30 | 北京星选科技有限公司 | User classification method and device, electronic equipment and readable storage medium |
CN112800109A (en) * | 2021-01-21 | 2021-05-14 | 蜜兔(杭州)网络科技有限公司 | Information mining method and system |
CN113313561A (en) * | 2021-07-29 | 2021-08-27 | 全屋优品科技(深圳)有限公司 | Transaction management method and system for home soft package supply chain |
CN114268559A (en) * | 2021-12-27 | 2022-04-01 | 天翼物联科技有限公司 | Directional network detection method, device, equipment and medium based on TF-IDF algorithm |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080147485A1 (en) * | 2006-12-14 | 2008-06-19 | International Business Machines Corporation | Customer Segment Estimation Apparatus |
CN104090888A (en) * | 2013-12-10 | 2014-10-08 | 深圳市腾讯计算机系统有限公司 | Method and device for analyzing user behavior data |
WO2015094281A1 (en) * | 2013-12-19 | 2015-06-25 | Hewlett-Packard Development Company, L.P. | Residual data identification |
CN106127525A (en) * | 2016-06-27 | 2016-11-16 | 浙江大学 | A kind of TV shopping Method of Commodity Recommendation based on sorting algorithm |
CN106202177A (en) * | 2016-06-27 | 2016-12-07 | 腾讯科技(深圳)有限公司 | A kind of file classification method and device |
-
2017
- 2017-05-31 CN CN201710401985.2A patent/CN107273454B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080147485A1 (en) * | 2006-12-14 | 2008-06-19 | International Business Machines Corporation | Customer Segment Estimation Apparatus |
CN104090888A (en) * | 2013-12-10 | 2014-10-08 | 深圳市腾讯计算机系统有限公司 | Method and device for analyzing user behavior data |
WO2015094281A1 (en) * | 2013-12-19 | 2015-06-25 | Hewlett-Packard Development Company, L.P. | Residual data identification |
CN106127525A (en) * | 2016-06-27 | 2016-11-16 | 浙江大学 | A kind of TV shopping Method of Commodity Recommendation based on sorting algorithm |
CN106202177A (en) * | 2016-06-27 | 2016-12-07 | 腾讯科技(深圳)有限公司 | A kind of file classification method and device |
Non-Patent Citations (2)
Title |
---|
BING LIU ET AL: "Building Text Classifiers Using Positive and Unlabeled Examples", 《THIRD IEEE INTERNATIONAL CONFERENCE ON DATA MINING》 * |
孙静宇等: "区分用户长短期兴趣的IBCF改进算法", 《郑州大学学报(理学版)》 * |
Cited By (43)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109801091B (en) * | 2017-11-16 | 2022-12-20 | 腾讯科技(深圳)有限公司 | Target user group positioning method and device, computer equipment and storage medium |
CN109801091A (en) * | 2017-11-16 | 2019-05-24 | 腾讯科技(深圳)有限公司 | Targeted user population localization method, device, computer equipment and storage medium |
CN109840788B (en) * | 2017-11-27 | 2021-11-02 | 北京京东尚科信息技术有限公司 | Method and device for analyzing user behavior data |
CN109840788A (en) * | 2017-11-27 | 2019-06-04 | 北京京东尚科信息技术有限公司 | For analyzing the method and device of user behavior data |
WO2019114481A1 (en) * | 2017-12-13 | 2019-06-20 | 腾讯科技(深圳)有限公司 | Cluster type recognition method, apparatus, electronic apparatus, and storage medium |
CN110392899B (en) * | 2017-12-18 | 2023-09-15 | 甲骨文国际公司 | Dynamic feature selection for model generation |
CN110392899A (en) * | 2017-12-18 | 2019-10-29 | 甲骨文国际公司 | The dynamic feature selection generated for model |
CN109961308A (en) * | 2017-12-25 | 2019-07-02 | 北京京东尚科信息技术有限公司 | The method and apparatus of assessment tag data |
CN108256907A (en) * | 2018-01-09 | 2018-07-06 | 北京腾云天下科技有限公司 | A kind of construction method and computing device of customer grouping model |
CN108364192B (en) * | 2018-01-16 | 2022-10-18 | 创新先进技术有限公司 | User mining method and device and electronic equipment |
CN108364192A (en) * | 2018-01-16 | 2018-08-03 | 阿里巴巴集团控股有限公司 | A kind of usage mining method, apparatus and electronic equipment |
CN108305099A (en) * | 2018-01-18 | 2018-07-20 | 阿里巴巴集团控股有限公司 | Determine the method and device for buying user on behalf |
CN108305099B (en) * | 2018-01-18 | 2021-11-19 | 创新先进技术有限公司 | Method and device for determining purchasing user |
CN108399418B (en) * | 2018-01-23 | 2021-09-03 | 北京奇艺世纪科技有限公司 | User classification method and device |
CN108399418A (en) * | 2018-01-23 | 2018-08-14 | 北京奇艺世纪科技有限公司 | A kind of user classification method and device |
CN110706049A (en) * | 2018-07-10 | 2020-01-17 | 北京京东尚科信息技术有限公司 | Data processing method and device |
CN109191167A (en) * | 2018-07-17 | 2019-01-11 | 阿里巴巴集团控股有限公司 | A kind of method for digging and device of target user |
CN110428295A (en) * | 2018-08-01 | 2019-11-08 | 北京京东尚科信息技术有限公司 | Method of Commodity Recommendation and system |
CN110826579A (en) * | 2018-08-07 | 2020-02-21 | 北京京东尚科信息技术有限公司 | Commodity classification method and device |
CN109087145A (en) * | 2018-08-13 | 2018-12-25 | 阿里巴巴集团控股有限公司 | Target group's method for digging, device, server and readable storage medium storing program for executing |
CN109325525A (en) * | 2018-08-31 | 2019-02-12 | 阿里巴巴集团控股有限公司 | Sample attribute assessment models training method, device and server |
CN111225009A (en) * | 2018-11-27 | 2020-06-02 | 北京沃东天骏信息技术有限公司 | Method and apparatus for generating information |
CN111340053A (en) * | 2018-12-03 | 2020-06-26 | 北京嘀嘀无限科技发展有限公司 | Order classification method, classification system, computer device and readable storage medium |
CN111325228A (en) * | 2018-12-17 | 2020-06-23 | 上海游昆信息技术有限公司 | Model training method and device |
CN111325228B (en) * | 2018-12-17 | 2021-04-06 | 上海游昆信息技术有限公司 | Model training method and device |
CN109948730A (en) * | 2019-03-29 | 2019-06-28 | 中诚信征信有限公司 | A kind of data classification method, device, electronic equipment and storage medium |
CN110322281B (en) * | 2019-06-06 | 2023-10-27 | 创新先进技术有限公司 | Similar user mining method and device |
CN110322281A (en) * | 2019-06-06 | 2019-10-11 | 阿里巴巴集团控股有限公司 | The method for digging and device of similar users |
CN110458641A (en) * | 2019-06-28 | 2019-11-15 | 苏宁云计算有限公司 | A kind of electric business recommended method and system |
CN110458641B (en) * | 2019-06-28 | 2022-02-25 | 苏宁云计算有限公司 | E-commerce recommendation method and system |
CN110597984A (en) * | 2019-08-12 | 2019-12-20 | 大箴(杭州)科技有限公司 | Method and device for determining abnormal behavior user information, storage medium and terminal |
CN110796482A (en) * | 2019-09-27 | 2020-02-14 | 北京淇瑀信息科技有限公司 | Financial data classification method and device for machine learning model and electronic equipment |
CN110796171A (en) * | 2019-09-27 | 2020-02-14 | 北京淇瑀信息科技有限公司 | Unclassified sample processing method and device of machine learning model and electronic equipment |
CN112580681A (en) * | 2019-09-30 | 2021-03-30 | 北京星选科技有限公司 | User classification method and device, electronic equipment and readable storage medium |
CN110807546A (en) * | 2019-10-22 | 2020-02-18 | 恒大智慧科技有限公司 | Community grid population change early warning method and system |
CN110807547A (en) * | 2019-10-22 | 2020-02-18 | 恒大智慧科技有限公司 | Method and system for predicting family population structure |
CN111401962A (en) * | 2020-03-20 | 2020-07-10 | 上海络昕信息科技有限公司 | Key opinion consumer mining method, device, equipment and medium |
CN111612519A (en) * | 2020-04-13 | 2020-09-01 | 广发证券股份有限公司 | Method, device and storage medium for identifying potential customers of financial product |
CN111612519B (en) * | 2020-04-13 | 2023-11-21 | 广发证券股份有限公司 | Method, device and storage medium for identifying potential customers of financial products |
CN112800109A (en) * | 2021-01-21 | 2021-05-14 | 蜜兔(杭州)网络科技有限公司 | Information mining method and system |
CN113313561A (en) * | 2021-07-29 | 2021-08-27 | 全屋优品科技(深圳)有限公司 | Transaction management method and system for home soft package supply chain |
CN114268559A (en) * | 2021-12-27 | 2022-04-01 | 天翼物联科技有限公司 | Directional network detection method, device, equipment and medium based on TF-IDF algorithm |
CN114268559B (en) * | 2021-12-27 | 2024-02-20 | 天翼物联科技有限公司 | Directional network detection method, device, equipment and medium based on TF-IDF algorithm |
Also Published As
Publication number | Publication date |
---|---|
CN107273454B (en) | 2020-11-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107273454A (en) | User data sorting technique, device, server and computer-readable recording medium | |
Bi et al. | Wisdom of crowds: Conducting importance-performance analysis (IPA) through online reviews | |
CN107016026B (en) | User tag determination method, information push method, user tag determination device, information push device | |
JP6693502B2 (en) | Information processing apparatus, information processing method, and program | |
CN107346496B (en) | Target user orientation method and device | |
CN109542916A (en) | Platform commodity enter method, apparatus, computer equipment and storage medium | |
CN108734184B (en) | Method and device for analyzing sensitive image | |
Trépanier et al. | Are transit users loyal? Revelations from a hazard model based on smart card data | |
CN108205766A (en) | Information-pushing method, apparatus and system | |
CN109933699A (en) | A kind of construction method and device of academic portrait model | |
CN108804704A (en) | A kind of user's depth portrait method and device | |
CN109636430A (en) | Object identifying method and its system | |
CN103177129B (en) | Internet real-time information recommendation prognoses system | |
CN111178954A (en) | Advertisement putting method and system and electronic equipment | |
CN111311316B (en) | Method and device for depicting merchant portrait, electronic equipment, verification method and system | |
CN110009379A (en) | A kind of building of site selection model and site selecting method, device and equipment | |
CN107077498A (en) | The presentation-entity relation in online advertisement | |
WO2020150611A1 (en) | Systems and methods for entity performance and risk scoring | |
CN112070310A (en) | Loss user prediction method and device based on artificial intelligence and electronic equipment | |
CN109993544A (en) | Data processing method, system, computer system and computer readable storage medium | |
Chapleau et al. | Strict and deep comparison of revealed transit trip structure between computer-assisted telephone interview household travel survey and smart cards | |
Hatim et al. | E-FoodCart: An Online Food Ordering Service | |
Comber et al. | Building hierarchies of retail centers using Bayesian multilevel models | |
CN116628349A (en) | Information recommendation method, device, equipment, storage medium and program product | |
CN107644101A (en) | Information classification approach and device, information classification equipment and computer-readable medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |