CN108038739A - A kind of method and system that extending user is determined according to the statistics degree of association - Google Patents
A kind of method and system that extending user is determined according to the statistics degree of association Download PDFInfo
- Publication number
- CN108038739A CN108038739A CN201711446826.0A CN201711446826A CN108038739A CN 108038739 A CN108038739 A CN 108038739A CN 201711446826 A CN201711446826 A CN 201711446826A CN 108038739 A CN108038739 A CN 108038739A
- Authority
- CN
- China
- Prior art keywords
- user
- users
- degree
- score value
- behavior
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/02—Marketing; Price estimation or determination; Fundraising
- G06Q30/0241—Advertisements
- G06Q30/0251—Targeted advertisements
- G06Q30/0254—Targeted advertisements based on statistics
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/02—Marketing; Price estimation or determination; Fundraising
- G06Q30/0241—Advertisements
- G06Q30/0251—Targeted advertisements
- G06Q30/0255—Targeted advertisements based on user history
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/02—Marketing; Price estimation or determination; Fundraising
- G06Q30/0241—Advertisements
- G06Q30/0277—Online advertisement
Landscapes
- Business, Economics & Management (AREA)
- Finance (AREA)
- Strategic Management (AREA)
- Engineering & Computer Science (AREA)
- Accounting & Taxation (AREA)
- Development Economics (AREA)
- Physics & Mathematics (AREA)
- Game Theory and Decision Science (AREA)
- Entrepreneurship & Innovation (AREA)
- Economics (AREA)
- Marketing (AREA)
- General Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Probability & Statistics with Applications (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of method and system that extending user is determined according to the statistics degree of association, by being scheduled to the customer demand of input, obtain the negative sample collection for positive sample collection and user completely unrelated with the basic user composition for including basic user, the user characteristics training pattern of user is concentrated by positive sample collection and negative sample, obtain computation rule, calculate the degree of association score value that user is concentrated with positive sample for each user of the whole network one by one according to computation rule, be expanded user according to degree of association score value.Client can obtain the audient's demographic data to match with oneself actual demand, and precision is high, can fully meet the different demands of client.
Description
Technical field
Determined the present invention relates to Internet technical field, and more particularly, to one kind according to the statistics degree of association
The method and system of extending user.
Background technology
In Internet advertising field, for the businessman for launching advertisement, advertisement is launched to any crowd on a large scale, is deposited
It is too high in cost, it is difficult to which that the problem of bearing, how from substantial amounts of netizen, pointedly selects suitable crowd, further according to not
Each determined property with crowd goes out to need the advertisement crowd launched, is Internet advertising market development urgent problem.
At present, it is in Internet advertising field to provide more valuable crowd to advertiser using crowd's orientation method
One important step, crowd's orientation method are by the analysis to user characteristic data, are found out and seed crowd behaviour feature
The joint act feature of similar potential target crowd, using machine learning model, predicts target audience's demographic data, side is wide
Accuse the main target group for finding and oneself being look for.The scale of wherein involved seed demographic data is at most in millions of amounts
Level, and the scale of non-seed demographic data, when machine learning model is trained, can cause in several hundred million magnitudes, both ratios great disparity
Memory adds model training and the memory overhead and time overhead of prediction using larger waste is above had.
Meanwhile, it is necessary to encoded to plaintext feature in some Machine learning tools, then can just do model training and
Prediction, for example currently have 10,000,000 different characteristics, it is necessary to be encoded to them with 1 to 1,000 ten thousand, possible feature " accessed
Sports.sina.com.cn " is encoded as 11, and feature " searching for tourism " is encoded as 999.
In traditional scheme, using unit feature coding, i.e., using single machine, the file of storage feature is traveled through,
Encode successively at the same time.There are following 2 points deficiencies for the program:
If 1) tag file is especially big, than the feature if any tens times, then scheme operation is slower;
2) if tag file is originally to be stored in HDFS, while requires the tag file after coding to be also stored in HDFS
On, then need first to download lower data from HDFS in this way, while the tag file after coding is uploaded on HDFS, and
These can give exploitation and maintenance to bring extra work.
The content of the invention
To solve the above-mentioned problems, there is provided a kind of method and system that extending user is determined according to the statistics degree of association.
According to an aspect of the invention, there is provided a kind of method that extending user is determined according to the statistics degree of association,
The described method includes:
Obtain the statistics associated with the network behavior of all users in data network, and to the statistics into
Row feature extraction is with the user characteristics of definite all users;
The extended requests that fellow users extension is carried out to basic user are received, the extended requests are parsed to determine
The setting quantity of extending user and the positive sample collection including multiple basic users;
Determine to include multiple training users' according to all users in the data network and the multiple basic user
Negative sample collection, wherein the ratio of the basic user and the quantity of training user is less than or equal to predetermined threshold;
The user characteristics of the multiple training users concentrated to the negative sample carries out signature analysis, to determine to be used for each
The computation rule that user's degree of being associated calculates;
The degree of association score value of each user in all users is calculated based on the computation rule, according to the degree of association point
The descending order of value is ranked up all users to generate user list;And
The user of the highest setting quantity of degree of association score value in the user list for eliminating the multiple basic user is true
It is set to extending user.
Preferably, the method further includes:By degree of association score value in the user list for not removing the multiple basic user
The user of highest setting quantity is determined as extending user.
Preferably, according to the statistics of the network behavior off-line data of all users of data network, all users are extracted
User characteristics.
Preferably, the network behavior of the user includes:Search click behavior, browse webpage behavior and/or by the 3rd
The behavior that Fang Hezuo is obtained.
Preferably, the user characteristics includes:The host features of user, n-gram features, surf time section, belonging to online
Region and/or browse commodity behavior.
Preferably, the set for the multiple user properties selected according to client and the basic user quantity of client's input, really
Determine the positive sample collection of basic user.
Preferably, by all users in the data network and the multiple basic user the user characteristics degree of association
The less user of score value, classification become the negative sample collection for including multiple training users.
Preferably, filtering the obvious abnormal dirty sample data of user characteristics in all users in the data network, obtain
To negative sample collection.
Preferably, negative sampling is carried out to the user characteristics of all users in the data network, according to the setting threshold
Value and basic user quantity, obtain the quantity of the negative sample concentration training user.
Preferably, concentrate multiple training users and the user of the basic user of positive sample concentration special to the negative sample
Sign is extracted respectively, compares the relevance of the two, extracts the computation rule.
Preferably, according to the computation rule, the user characteristics of multiple users in the data network is compared one by one
To calculating, each user and the degree of association score value of the basic user are assigned according to contrast conting result.
Preferably, multiple users in the data network are ranked up according to its degree of association score value, and to the knot of sequence
Fruit is adjusted according to user property.
Preferably, all users in the data network are carried out with negative sampling obtains the negative sample collection, to the basis
User carries out positive sampling and obtains positive sample collection;The negative sampling and the downsampling factor just sampled are set as needed.
Preferably, the negative sampling and the downsampling factor just sampled concentrate basic user number according to the positive sample of actual needs
Amount and the setting of negative sample concentration training number of users.
According to another aspect of the present invention, there is provided a kind of to be according to what the statistics degree of association determined extending user
System, the system comprises:
User characteristics unit, for obtaining the statistics associated with the network behavior of all users in data network,
And feature extraction is carried out to the statistics to determine the user characteristics of all users;
Positive sample collection unit, carries out basic user for receiving the extended requests of fellow users extension, to the extension
Request is parsed the setting quantity to determine extending user and the positive sample collection including multiple basic users;
Negative sample collection unit, for determining to wrap according to all users in the data network and the multiple basic user
The negative sample collection of multiple training users is included, is made a reservation for wherein the ratio of the basic user and the quantity of training user is less than or equal to
Threshold value;
Computation rule unit, the user characteristics of multiple training users for being concentrated to the negative sample carry out feature point
Analysis, to determine the computation rule for calculating each user's degree of being associated;
Calculation of relationship degree unit, for calculating the degree of association point of each user in all users based on the computation rule
Value, is ranked up to generate user list all users according to the descending order of the degree of association score value;
Extending user unit, for degree of association score value in the user list for eliminating the multiple basic user is highest
The user of setting quantity is determined as extending user.
Preferably, the system also includes:By degree of association score value in the user list for not removing the multiple basic user
The user of highest setting quantity is determined as extending user.
Preferably, according to the statistics of the network behavior off-line data of all users of data network, all users are extracted
User characteristics.
Preferably, the network behavior of the user includes:Search click behavior, browse webpage behavior and/or by the 3rd
The behavior that Fang Hezuo is obtained.
Preferably, the user characteristics includes:The host features of user, n-gram features, surf time section, belonging to online
Region and/or browse commodity behavior.
Preferably, the set for the multiple user properties selected according to client and the basic user quantity of client's input, really
Determine the positive sample collection of basic user.
Preferably, by all users in the data network and the multiple basic user the user characteristics degree of association
The less user of score value, classification become the negative sample collection for including multiple training users.
Preferably, filtering the obvious abnormal dirty sample data of user characteristics in all users in the data network, obtain
To negative sample collection.
Preferably, negative sampling is carried out to the user characteristics of all users in the data network, according to the setting threshold
Value and basic user quantity, obtain the quantity of the negative sample concentration training user.
Preferably, concentrate multiple training users and the user of the basic user of positive sample concentration special to the negative sample
Sign is extracted respectively, compares the relevance of the two, extracts the computation rule.
Preferably, according to the computation rule, the user characteristics of multiple users in the data network is compared one by one
To calculating, each user and the degree of association score value of the basic user are assigned according to contrast conting result.
Preferably, multiple users in the data network are ranked up according to its degree of association score value, and to the knot of sequence
Fruit is adjusted according to user property.
Preferably, all users in the data network are carried out with negative sampling obtains the negative sample collection, to the basis
User carries out positive sampling and obtains positive sample collection;The negative sampling and the downsampling factor just sampled are set as needed.
Preferably, the negative sampling and the downsampling factor just sampled concentrate basic user number according to the positive sample of actual needs
Amount and the setting of negative sample concentration training number of users.
According to another aspect of the present invention, there is provided a kind of mobile terminal, including or for performing as above any one
The system.
In the scheme that each embodiment of the present invention is provided, by being scheduled to the customer demand of input, including
The negative sample collection of the positive sample collection of basic user and user completely unrelated with basic user composition, by positive sample collection and bears
The user characteristics training pattern of user, obtains computation rule in sample set, is counted one by one for each user of the whole network according to computation rule
The degree of association score value that user is concentrated with positive sample is calculated, is expanded user according to degree of association score value.Client can obtain real with oneself
Audient's demographic data that border demand matches, precision is high, can fully meet the different demands of client.
Described above is only the general introduction of technical solution of the present invention, in order to better understand the technological means of the present invention,
And can be practiced according to the content of specification, and in order to allow above and other objects of the present invention, feature and advantage can
Become apparent, below especially exemplified by the embodiment of the present invention.
Brief description of the drawings
By reading the detailed description of hereafter preferred embodiment, it is various other the advantages of and benefit it is common for this area
Technical staff will be clear understanding.Attached drawing is only used for showing the purpose of preferred embodiment, and is not considered as to the present invention
Limitation.And in whole attached drawing, identical component is denoted by the same reference numerals.In the accompanying drawings:
Fig. 1 is DMP Look-alike online system structure diagrams provided in an embodiment of the present invention.
Fig. 2 is the method flow diagram provided in an embodiment of the present invention that extending user is determined according to the statistics degree of association.
Fig. 3 is the system structure signal provided in an embodiment of the present invention that extending user is determined according to the statistics degree of association
Figure.
Fig. 4 is the method flow diagram provided in an embodiment of the present invention that extending user is determined according to weighted calculation.
Fig. 5 is the system structure diagram provided in an embodiment of the present invention that extending user is determined according to weighted calculation.
Fig. 6 is the method flow diagram provided in an embodiment of the present invention that extending user is determined according to statistics interest-degree.
Fig. 7 is the system structure signal provided in an embodiment of the present invention that extending user is determined according to statistics interest-degree
Figure.
Fig. 8 is the method flow diagram provided in an embodiment of the present invention for being used to carry out user characteristics distributed coding.
Fig. 9 is the system structure diagram provided in an embodiment of the present invention for being used to carry out user characteristics distributed coding.
Embodiment
The illustrative embodiments of the present invention are introduced referring now to attached drawing, however, the present invention can use many different shapes
Formula is implemented, and is not limited to the embodiment described herein, there is provided these embodiments are to disclose at large and fully
The present invention, and fully pass on the scope of the present invention to person of ordinary skill in the field.Show for what is be illustrated in the accompanying drawings
Term in example property embodiment is not limitation of the invention.In the accompanying drawings, identical cells/elements use identical attached
Icon is remembered.
Unless otherwise indicated, term (including scientific and technical terminology) used herein has person of ordinary skill in the field
It is common to understand implication.Further it will be understood that the term limited with usually used dictionary, be appreciated that and its
The linguistic context of association area has consistent implication, and is not construed as Utopian or overly formal meaning.
The each embodiment of the present invention be based on the DMP Look-alike online systems shown in Fig. 1, as shown in Figure 1, its
In:
Offline flows:Based on Distributed Computing Platform (Hadoop+Spark), to the network behavior (ratio of the whole network user
Such as:Search click behavior, browse webpage behavior, by behavior for being obtained with third company cooperation etc.) extraction user characteristics (compares
Such as:Host features, n-gram features, the surf time section, online belonging to region, browse commodity etc.).
Online flows:The whole network user characteristics and scheduler calculated based on offline flows is sent out by task scheduling
The seed crowd come, using appropriate machine learning model (such as in disaggregated model in supervised learning, unsupervised learning
Clustering Model etc.), by model training and prediction, find similar target audience crowd.
The demand of advertiser greatly with the limited always contradiction of computing resource.In order to ensure the profit of each advertiser from the overall situation
Benefit, scheduler modules therein with regard to extremely important, it can consider factors (such as:What advertiser have submitted
Look-alike number of tasks, advertiser are DSP launches the consuming capacity at end, advertiser's special demand (for example double 11), is
It is not no mine massively to same Ziren with multiple extension multiples) look-alike mission requirements are dispatched, once some look-alike
Task is dispatched, then can start online flows and calculate it similar target audience crowd.
During similar crowd is excavated, crowd's extension, i.e., look-alike Main Basiss user base attribute and its
The behavioural information possessed, this just needs huge data storage as analysis source.Data management platform (DMP, Data-
Management Platform) be crowd's growth data analysis method basis.Crowd's growth data analysis method development company
Can be based on a large number of users itself covered, collection and depth excavate the number greatly of corresponding behavioral chain on the premise of individual privacy is protected
According to the search of such as user is clicked on, browses webpage network behavior data.In general, client, i.e. advertiser will be apparent that oneself
The product of advertisement and behind is wanted to touch the user group reached, for example App advertiser can be accurately grasped in oneself App product
Any active ues IMEI or IFA, electric business website advertiser have the cookie or cell-phone number, O2O of user interested in certain commodity
Advertiser might have telephone number of client etc..Therefore, the first party data that client has by oneself can also be obtained, as official website visitor,
Place an order, pay close attention to the data such as wechat, concern microblogging and installation mobile application client.It can also be obtained outer by cooperating with third party
The labeling data of portion affiliate, for example, user access website, using APP, watch video, place an order shopping and connection hot spot etc.
Network behavior data.Thus the network behavior off-line data of the user obtained becomes the integration of DMP emphasis after anonymous desensibilization
Critical data.
When obtaining the network behavior off-line data of user, Distributed Computing Platform (Hadoop+Spark) can be also based on,
According to the network behavior daily record of user, the network behavior off-line data of acquisition the whole network user, by the network behavior daily record of user
Data good data can be provided for researchs such as user interest discovery, resource recommendations support, easy to the user's according to acquisition
Network behavior off-line data extracts user characteristic data.
By long-term data acquisition and subdivision, user characteristics number can be extracted according to the network behavior off-line data of user
According to, such as the extraction host features of user, n-gram features, surf time section, region belonging to online and/or browse commodity behavior
Deng user characteristics, user is finely divided management, from latitudes such as the action trail of user, interest preference, consumer behavior, geographical locations
Degree realizes seeing clearly and analyzing for all types of user, obtains the characteristic that all types of user matches, is stored, the use of daily output
The disk storage overhead of family characteristic is more than TB magnitudes.Delineation mesh can be freely combined in the user characteristic data that analyze extraction
Mark crowd, launch displaying, search, brand, using downloads ad when, can quickly and accurately be directed to a certain category feature people
Group.
The demand of client greatly with the limited always contradiction of computing resource, in order to ensure the interests of each client from the overall situation,
Agree with the personalized focal need of client, need to consider all multi-parameters in scheduling process to dispatch look-alike mission requirements,
Specifically, default scheduling parameter includes look-alike number of tasks, client that client have submitted and launches end in DSP
Whether consuming capacity, client's special demand and/or client mine massively same Ziren extends multiples etc. with multiple, once some
Look-alike tasks are dispatched, then can be according to seed demographic data and user characteristic data after obtaining seed demographic data
To calculating the target audience crowd similar to seed demographic data, more similar target audiences are found, expands precision marketing and covers
Lid scope.In addition, may also be combined with the first party data that client has by oneself, such as official website visitor, place an order, pay close attention to wechat, concern microblogging and
The data such as mobile application client are installed as seed demographic data, in the same of the accurately customer data grasped using client
When, it can also meet the personalized customization demand of client.
Client can be based on its own demand, it is autonomous determine look-alike number of tasks, its consumption energy at end is launched in DSP
Power, special needs, such as the scheduling parameter such as double 11, Holiday Sale, can also independently determine user's magnitude after extension, be
No to mine massively to same Ziren with multiple extension multiples, the specific multiple that extends is how many, passes through and the customer demand of input is carried out
Scheduling, obtains seed demographic data, carries out automation extension easy to the follow-up shared attribute according to seed demographic data, can fill
Divide and meet different advertisers to the precisely different demands with covering.
DSP (Demand Side Platform) described above is party in request's platform, it is responsible for receiving dispensing demand,
Look for demographic data, realize a central management control platform for launching the function such as bid, it is from being mainly characterized by precise positioning mesh
Mark crowd.For example, advertiser when launching advertisement, dispensing demand is inputted on DSP, target audience is drawn a circle to approve according to dispensing demand
Description, such as age, gender, occupation and hobby etc. can also set fixed condition, as user clicked on every time using PC it is wide
The unit price of announcement is no more than 2 points of, and then these conditions are sent in dsp system, and dsp system is linked up with DMP, according to DMP systems
The user characteristic data of middle analysis extraction, is found out the crowd of condition coupling, is carried out by using actual environments such as media resources
Advertisement putting.
After seed demographic data is obtained, seed crowd need to be analyzed from multiple dimensions, therefrom filtered out most
Representational common characteristic, according to these feature combination user characteristic datas, filters out another batch from a large amount of any active ues
The user similar to seed crowd.Specifically:Computation model can be established according to machine learning, which can use supervision to learn
The Clustering Model in disaggregated model or unsupervised learning in habit, seed demographic data and user characteristic data are substituted into and calculate mould
Type, carries out calculating analysis, obtains the user similar to seed crowd, and the user scope that can orient advertisement is from extensive characteristic
According to more accurate user is contracted to, advertiser is met to the precisely different demands with covering, improves crowd's expansion efficiency.
In fact, a simple examples of the above-mentioned simply embodiment of the present invention, particular content of the invention will be by following
Each embodiment describes one by one.
As shown in Fig. 2, be the method provided in an embodiment of the present invention that extending user is determined according to the statistics degree of association, its
In,
Step 201, the statistics associated with the network behavior of all users in data network is obtained, and to the system
Count and carry out feature extraction to determine the user characteristics of all users.
Data network can be general internet data or various dedicated networks., wherein it is desired to obtain data
The network behavior data of all users in network.The network behavior of user can pass through various operations of the user during online
To obtain, for example, which website can be logged in including user, which content is browsed, or watches which video to determine.User
Network behavior it is varied, can be the operation content of user or the behavior trace content of user.
Obtaining user network behavior can be carried out by recording the network behavior of user, can also be soft by various networks
Hardware obtains.In fact, the acquisition of user network behavior, is more sorted out and is recorded based on the analysis to user network behavior.
, can be by user network by various statistical analyses, it is necessary to carry out statistical analysis after the network behavior of user is recorded
Network behavior is sorted out.User network behavior includes polytype behavior and plurality of kinds of contents, it is necessary to classification storage.Deposited in classification
On the basis of storage, it is counted, so as to obtain statistics.Include in statistics all user network behaviors with
And the various possible network behaviors by user network behavior sorted generalization.The network behavior of user can include:Search is clicked on
Behavior, browse webpage behavior and/or the behavior obtained by third party's cooperation.
In fact, the analysis to user network behavior, can extract user characteristics.User characteristics is that user registration exists
Various actions feature on network.User characteristics is the motion characteristic of user, including operation behavior of the user on network and can
The action behavior of energy.User characteristics has characterized each action details of user's operation, so as to therefrom determine user's debarkation net
The behavioural habits of network and corresponding prediction is made to its behavior.
User characteristics can include the host features of user, n-gram features, surf time section, region belonging to online and/
Or browse commodity behavior.User characteristics is different from user property.User property is usually the fixed attribute of user, including with
The user identifier at family, age, IP etc. are attached to the attribute information of user itself.And user characteristics is dynamic during user's online
Make behavioural information, be the operability breath of user in a network, be dynamic.
Thus, in the environment of mass users, user property is attached to number of users, and magnanimity.And user characteristics,
Due to being related to a variety of behaviors of mass users, thus, its data volume is even more more huger than user attribute data.
Step 202, the extended requests that fellow users extension is carried out to basic user are received, the extended requests are solved
Analysis is with the setting quantity of definite extending user and the positive sample collection including multiple basic users.
Basic user is the user that client is provided as seed.Client can be that advertiser etc. has user is expanded
The client of exhibition.Basic user is put forward by client, is set according to the demand of client.Generally, the generation of basic user,
The set for the multiple user properties that can be selected according to client and the basic user quantity of client's input, determine the multiple base
Plinth user.Namely client sets basic user according to the user property that itself pays close attention to.
Client after above-mentioned look-alike systems are logged in, can select key user's attribute of itself concern, system
According to user property corresponding basic user dynamic listing and quantity are provided for client.The group of the user property inputted according to client
Change is closed, dynamic adjusts basic user dynamic listing, untill the quality and quantity of basic user meet the demand of client.
Client also needs to set at the same time the scale of required user's extension, namely client sets extension demand, according to expansion
Exhibition demand determines the quantity of extending user.Used for example, client can obtain 1,000,000 bases by the adjustment to user property
Family, then inputs overall extending user scale as 10,000,000.At this time, the scale of user's extension is 10 times.
After generating basic user, basic user can be incorporated into positive sample collection.Positive sample is grouped as basic user, passes through
These basic users can extract the overall user feature of customer demand, so as to obtain the actual demand of client.
Step 203, determine to include multiple instructions according to all users in the data network and the multiple basic user
Practice the negative sample collection of user, wherein the ratio of the basic user and the quantity of training user is less than or equal to predetermined threshold.
After obtaining positive sample collection, it is also necessary to which training obtains negative sample collection.Negative sample collection is all users out of data network
User characteristics in extract user composition.Wherein, the user that negative sample is concentrated is the basic user phase concentrated with positive sample
The minimum user's set of closing property.That is, negative sample concentrates user substantially to concentrate user completely unrelated with positive sample.
The acquisition of negative sample collection finds it, it is necessary to the user characteristics for aligning basic user in sample set carries out analysis extraction
The general character of user characteristics, therefrom extracts the elemental user feature of basic user, then compares one by one in all users of whole network again
To user characteristics, so as to find user corresponding with the user characteristics of elemental user feature correlation minimum.These users are put
Enter negative sample collection.
The user that negative sample is concentrated also needs to carry out relevant further processing, to dispose wherein unrelated data.
For example, it is desired to filter out the user data of obvious exception.
For ease of the training of follow-up computation model, the positive sample collection in the user characteristic data of said extracted can be demarcated
It is negative class sample data by the negative sample collection data scaling in the user characteristic data of said extracted for positive class sample data.Cause
For data characteristics, the scale of positive class sample is at most in millions of magnitudes, and the scale of negative class sample is in several hundred million magnitudes, so positive class
The ratio of sample and negative class sample reaches 1:10 or even 1:100, this is unfavorable for machine learning model, particularly disaggregated model, learns
Practise effective model.For this reason, the processing of following step can be made to above-mentioned sample.
Over-sampling is carried out to the positive class sample data, negative sampling is carried out to the negative class sample data.Specifically, can be with
According to positive class sample data and the ratio of negative class sample data, the sample rate of adjustment setting over-sampling and time sampling, by multigroup
Experiment, determines a feasible ratio.Preferably, before aligning class sample data and negative class sample data sampled, also
The user characteristic data of above-mentioned acquisition can be analyzed and filter its apoplexy involving the solid organs sample data, avoid influencing the accurate of follow-up disaggregated model
Property.
Step 204, the user characteristics of the multiple training users concentrated to the negative sample carries out signature analysis, to determine to use
In the computation rule calculated each user's degree of being associated.
The sampled data obtained to the over-sampling and negative sampling trains computation model.Obtain positive sample collection data it
Afterwards, user characteristics need to be analyzed from multiple dimensions, therefrom filters out most representational common characteristic, it is special according to these
Sign combines user characteristic data, and another crowd of user similar to seed crowd is filtered out from a large amount of any active ues.Specifically:It is first
First need select computation model, computation model may include logistic regression (logistic regression algorithm model) and/or
The models such as linear SVM (supporting vector machine model), by the sampled data obtained through above-mentioned over-sampling and negative sampling using calculating
Model carries out model training, obtains effective computation model.
Multiple training users and the user characteristics of the basic user of positive sample concentration is concentrated to distinguish to the negative sample
Extracted, compare the relevance of the two, extract the computation rule.
Here computation model is computation rule, and complete computation rule is extracted by way of model training.
Step 205, the degree of association score value of each user in all users is calculated based on the computation rule, according to described
The descending order of degree of association score value is ranked up all users to generate user list.
The computation rule obtained by above-mentioned model training, can do model prediction to the whole network user, be sorted based on prediction
Go out user of the prediction point more than certain threshold value as extension crowd, i.e., similar target audience crowd, the use that advertisement can be oriented
Family scope is contracted to more accurate user from extensive characteristic, meet advertiser to precisely and covering different demands,
Improve crowd's expansion efficiency.
According to the computation rule, calculating is compared in the user characteristics of multiple users in the data network one by one,
Each user and the degree of association score value of the basic user are assigned according to contrast conting result.Will be more in the data network
A user is ranked up according to its degree of association score value, and the result of sequence is adjusted according to user property.
One degree of association score value generates each user by computation rule respectively, this degree of association score value characterizes user
With the degree of association of basic user.By arrangement of all users with the size of degree of association score value from big to small, user list is obtained, its
In include putting in order for all users and respective degree of association score value.
Step 206, by the highest setting quantity of degree of association score value in the user list for eliminating the multiple basic user
User be determined as extending user.
After obtaining the relevant user's arrangement of specific degree of association score value, it can be chosen according to the size of degree of association score value
The higher certain customers of middle degree of association score value are as extending user.Specifically quantity is determined according to the setting of client, Ke Yishi
The extending user scale amounts of client's setting.
The basic user initially selected due to including client in the whole network user, and these basic users not necessarily degree of association
Score value is higher, therefore, it is possible to the selection according to client, it is determined whether needs are deleted in final extending user recommendation list
Basic user.
Delete basic user time can be before calculating correlation score value, can also calculating correlation score value it
Afterwards.Alternatively, can be before or after extending user be recommended.
In the present embodiment, the customer demand of input is scheduled, obtains the positive sample collection and and base for including basic user
The negative sample collection of the completely unrelated user's composition of plinth user, the user characteristics for concentrating user by positive sample collection and negative sample are instructed
Practice model, obtain computation rule, calculated one by one for each user of the whole network according to computation rule and concentrate associating for user with positive sample
Score value is spent, is expanded user according to degree of association score value.Client can obtain the audient crowd's number to match with oneself actual demand
According to precision is high, can fully meet the different demands of client.
Fig. 3 shows that the present invention provides a kind of system that extending user is determined according to the statistics degree of association, the system
System includes:
User characteristics unit 301, for obtaining the statistical number associated with the network behavior of all users in data network
According to, and feature extraction is carried out to the statistics to determine the user characteristics of all users;
Positive sample collection unit 302, carries out basic user for receiving the extended requests of fellow users extension, to the expansion
Exhibition request is parsed the setting quantity to determine extending user and the positive sample collection including multiple basic users;
Negative sample collection unit 303, for true according to all users in the data network and the multiple basic user
Surely the negative sample collection of multiple training users is included, wherein the ratio of the basic user and the quantity of training user is less than or equal to
Predetermined threshold;
Computation rule unit 304, the user characteristics of multiple training users for being concentrated to the negative sample carry out feature
Analysis, to determine the computation rule for calculating each user's degree of being associated;
Calculation of relationship degree unit 305, for calculating the association of each user in all users based on the computation rule
Score value is spent, all users are ranked up to generate user list according to the descending order of the degree of association score value;
Extending user unit 306, for by degree of association score value in the user list for eliminating the multiple basic user most
The user of high setting quantity is determined as extending user.
Preferably, by the highest setting quantity of degree of association score value in the user list for not removing the multiple basic user
User is determined as extending user.
Preferably, according to the statistics of the network behavior off-line data of all users of data network, all users are extracted
User characteristics.
Preferably, the network behavior of the user includes:Search click behavior, browse webpage behavior and/or by the 3rd
The behavior that Fang Hezuo is obtained.
Preferably, the user characteristics includes:The host features of user, n-gram features, surf time section, belonging to online
Region and/or browse commodity behavior.
Preferably, the set for the multiple user properties selected according to client and the basic user quantity of client's input, really
Determine the positive sample collection of basic user.
Preferably, by all users in the data network and the multiple basic user the user characteristics degree of association
The less user of score value, classification become the negative sample collection for including multiple training users.
Preferably, filtering the obvious abnormal dirty sample data of user characteristics in all users in the data network, obtain
To negative sample collection.
Preferably, negative sampling is carried out to the user characteristics of all users in the data network, according to the setting threshold
Value and basic user quantity, obtain the quantity of the negative sample concentration training user.
Preferably, concentrate multiple training users and the user of the basic user of positive sample concentration special to the negative sample
Sign is extracted respectively, compares the relevance of the two, extracts the computation rule.
Preferably, according to the computation rule, the user characteristics of multiple users in the data network is compared one by one
To calculating, each user and the degree of association score value of the basic user are assigned according to contrast conting result.
Preferably, multiple users in the data network are ranked up according to its degree of association score value, and to the knot of sequence
Fruit is adjusted according to user property.
Preferably, all users in the data network are carried out with negative sampling obtains the negative sample collection, to the basis
User carries out positive sampling and obtains positive sample collection;The negative sampling and the downsampling factor just sampled are set as needed.
Preferably, the negative sampling and the downsampling factor just sampled concentrate basic user number according to the positive sample of actual needs
Amount and the setting of negative sample concentration training number of users.
As shown in figure 4, be a kind of method that extending user is determined according to weighted calculation provided in an embodiment of the present invention, its
In,
Step 401, the statistics associated with the network behavior of all users in data network is obtained, and to the system
Count and carry out feature extraction to determine the user characteristics of all users.
Data network can be general internet data or various dedicated networks., wherein it is desired to obtain data
The network behavior data of all users in network.The network behavior of user can pass through various operations of the user during online
To obtain, for example, which website can be logged in including user, which content is browsed, or watches which video to determine.User
Network behavior it is varied, can be the operation content of user or the behavior trace content of user.
Obtaining user network behavior can be carried out by recording the network behavior of user, can also be soft by various networks
Hardware obtains.In fact, the acquisition of user network behavior, is more sorted out and is recorded based on the analysis to user network behavior.
, can be by user network by various statistical analyses, it is necessary to carry out statistical analysis after the network behavior of user is recorded
Network behavior is sorted out.User network behavior includes polytype behavior and plurality of kinds of contents, it is necessary to classification storage.Deposited in classification
On the basis of storage, it is counted, so as to obtain statistics.Include in statistics all user network behaviors with
And the various possible network behaviors by user network behavior sorted generalization.The network behavior of user can include:Search is clicked on
Behavior, browse webpage behavior and/or the behavior obtained by third party's cooperation.
In fact, the analysis to user network behavior, can extract user characteristics.User characteristics is that user registration exists
Various actions feature on network.User characteristics is the motion characteristic of user, including operation behavior of the user on network and can
The action behavior of energy.User characteristics has characterized each action details of user's operation, so as to therefrom determine user's debarkation net
The behavioural habits of network and corresponding prediction is made to its behavior.
User characteristics can include the host features of user, n-gram features, surf time section, region belonging to online and/
Or browse commodity behavior.User characteristics is different from user property.User property is usually the fixed attribute of user, including with
The user identifier at family, age, IP etc. are attached to the attribute information of user itself.And user characteristics is dynamic during user's online
Make behavioural information, be the operability breath of user in a network, be dynamic.
Thus, in the environment of mass users, user property is attached to number of users, and magnanimity.And user characteristics,
Due to being related to a variety of behaviors of mass users, thus, its data volume is even more more huger than user attribute data.
Step 402, the extended requests that fellow users extension is carried out to basic user are received, the extended requests are solved
Analysis is with the setting quantity of definite extending user and multiple basic users.
Basic user is the user that client is provided as seed.Client can be that advertiser etc. has user is expanded
The client of exhibition.Basic user is put forward by client, is set according to the demand of client.Generally, the generation of basic user,
The set for the multiple user properties that can be selected according to client and the basic user quantity of client's input, determine the multiple base
Plinth user.Namely client sets basic user according to the user property that itself pays close attention to.
Client after above-mentioned look-alike systems are logged in, can select key user's attribute of itself concern, system
According to user property corresponding basic user dynamic listing and quantity are provided for client.The group of the user property inputted according to client
Change is closed, dynamic adjusts basic user dynamic listing, untill the quality and quantity of basic user meet the demand of client.
Client also needs to set at the same time the scale of required user's extension, namely client sets extension demand, according to expansion
Exhibition demand determines the quantity of extending user.Used for example, client can obtain 1,000,000 bases by the adjustment to user property
Family, then inputs overall extending user scale as 10,000,000.At this time, the scale of user's extension is 10 times.
Step 403, each corresponding sample set definite respectively being directed in multiple training rules set in advance, and root
Signature analysis is carried out according to the user characteristics in each sample set to determine the computation rule calculated each user's degree of being associated.
Training rules set in advance, can include a variety of, it can be common that classification supervised training to the user characteristics,
Cluster training to the user characteristics and/or the semi-supervised training to the user characteristics.
Supervised learning training of classifying is instructed by existing training sample (i.e. given data and its corresponding output)
Practice, so as to obtain an optimal models, recycle this model that all new data samples are mapped as output accordingly as a result,
The simple purpose judged so as to fulfill classification is carried out to output result, then this optimal models is also just provided with to unknown number
According to the ability classified.
The unsupervised learning training of cluster is in advance without any training data sample, it is necessary to directly be built to data
Mould.It is generally necessary to train cluster centre by clustering algorithm, exercised supervision study using cluster centre as classifying rules.
In addition, also semi-supervised learning training pattern, i.e., with reference to supervised learning training pattern and unsupervised learning training mould
The prioritization scheme of type.For example, it may be the model that is clustered on basis of classification calculates or enterprising on cluster basis
One step thinks that the model of interference classification calculates.
Specific training rules can be selected according to being actually needed, and be a variety of different training rule of selection in the present embodiment
Then, the whole network user is trained respectively according to a variety of training rules, so that it is determined that sample set corresponding with a variety of training rules.
Each sample set be as obtained from corresponding training rules, it is uncorrelated mutually.
User characteristics in each sample set further carries out signature analysis, it may be determined that goes out and all users are associated
Spend the computation rule of analysis.Equally, each computation rule is to be directed to different training rules, orthogonal.
The present embodiment, sample set is respectively trained by multiple training rules, and computation rule is being extracted by sample set.Calculate
The extraction of rule is typically by the way of model training.User characteristics is analyzed from multiple dimensions, is therefrom filtered out most
Representative common characteristic, according to these feature combination user characteristic datas, filters out another from a large amount of any active ues
Criticize the user similar to seed crowd.Specifically:Firstly the need of selection computation model, computation model may include logistic
The models such as regression (logistic regression algorithm model) and/or linear SVM (supporting vector machine model), will be through above-mentioned sample
The sampled data of this concentration carries out model training using computation model, obtains effective computation model.
The computation rule obtained by above-mentioned model training, can do model prediction to the whole network user, be sorted based on prediction
Go out user of the prediction point more than certain threshold value as extension crowd, i.e., similar target audience crowd, the use that advertisement can be oriented
Family scope is contracted to more accurate user from extensive characteristic, meet advertiser to precisely and covering different demands,
Improve crowd's expansion efficiency.
According to the computation rule, calculating is compared in the user characteristics of multiple users in the data network one by one,
Each user and the degree of association score value of the basic user are assigned according to contrast conting result.Will be more in the data network
A user is ranked up according to its degree of association score value, and the result of sequence is adjusted according to user property.
According to the multiple training rules, corresponding sample set is determined respectively;According to the user in each sample set
Feature analyzes all users, determines the degree of association of each user, and obtains the computation rule of calculation of relationship degree.
According to the computation rule, all user's degree of being associated are calculated respectively, obtain the degree of association point of each user
Value;The user is ranked up according to the degree of association score value of each user.
Step 404, the pass of each user in all users is calculated based on each computation rule in multiple computation rules
Connection degree score value, is ranked up all users according to the descending order of the degree of association score value to generate multiple user lists.
A variety of computation rules can calculate the degree of association score value of all users of a set of the whole network respectively.If for example, by three kinds
Computation rule, then each user of the whole network 3 degree of association score values can be calculated.According to the degree of association score value of each user, difference
The whole network user is arranged, obtains multiple user lists, namely every kind of computation rule corresponds to a user list, including
All user and respective degree of association score value put in order.
Step 405, weighted value is set for each user list according to the accuracy of each training rules, according to each user
The degree of association score value of each user is weighted in the weighted value of list, to determine each to use according to the result of weighted calculation
The output score value at family.
According to the degree of association score value of each user, the accuracy of corresponding computation rule is calculated;According to described accurate
Degree sets weighted value to corresponding computation rule.Since the accuracy of different computation rules is different, it is then desired to according to
Its accuracy for each user list set a weighted value, this weighted value be according to the accuracy of corresponding computation rule come
Setting.
According to the weighted value of each computation rule, the degree of association of each user obtained to each computation rule
Score value is weighted processing.
Degree of association score value by each user respectively in multiple user lists is multiplied with its weighted value, then weighted calculation is most
Whole output score value.Extending user is chosen according to final output score value.
In the present embodiment, it is the degree of association score value that aggregative weighted calculates each user under a variety of computation rules, then weights
Calculate final score value.By the weighting that every kind of computation rule of each user obtains handle as a result, being multiplied or phase
Add, obtain the output score value of each user.The user is ranked up according to the output score value of each user, output setting number
The user of amount is as extending user.
After obtaining the relevant user's arrangement of specific degree of association score value, it can be chosen according to the size of degree of association score value
The higher certain customers of middle degree of association score value are as extending user.Specifically quantity is determined according to the setting of client, Ke Yishi
The extending user scale amounts of client's setting.
Multiple users in data network are exported score value according to it to be ranked up, and to the result of sequence according to user
Attribute is adjusted.The user of the highest setting quantity of score value will be exported in the user list for eliminating the multiple basic user
It is determined as extending user.
The basic user initially selected due to including client in the whole network user, and these basic users not necessarily degree of association
Score value is higher, therefore, it is possible to the selection according to client, it is determined whether needs are deleted in final extending user recommendation list
Basic user.
Delete basic user time can be before calculating correlation score value, can also calculating correlation score value it
Afterwards.Alternatively, can be before or after extending user be recommended.
In the present embodiment, multiple sample sets are trained by multiple training rules set in advance, and then determine multiple
Computation rule;Respectively all users are carried out each user is calculated according to multiple computation rules and be directed to each computation rule
Degree of association score value, then calculate the weighted value of each computation rule, the degree of association point of each computation rule corresponded to reference to user
The weighted value of value and corresponding computation rule, each user's final output score value of weighted calculation, determines to set according to output score value
The extending user of quantity.Client can obtain the audient's demographic data to match with oneself actual demand, and precision is high, can fully expire
The different demands of sufficient client.
Fig. 5 shows a kind of system that extending user is determined according to weighted calculation provided in an embodiment of the present invention, described
System includes:
User characteristics unit 501, for obtaining the statistical number associated with the network behavior of all users in data network
According to, and feature extraction is carried out to the statistics to determine the user characteristics of all users;
Basic user unit 502, carries out basic user for receiving the extended requests of fellow users extension, to the expansion
Exhibition request is parsed setting quantity and the multiple basic users to determine extending user;
Computation rule unit 503, it is each definite corresponding respectively in multiple training rules set in advance for being directed to
Sample set, and user characteristics in each sample set carries out signature analysis to determine in terms of to each user's degree of being associated
The computation rule of calculation;
Degree of association score value computing unit 504, for all to calculate based on each computation rule in multiple computation rules
The degree of association score value of each user, is ranked up with life all users according to the descending order of the degree of association score value in user
Into multiple user lists;And
Score value computing unit 505 is exported, for setting power according to the accuracy of each training rules for each user list
Weight values, are weighted the degree of association score value of each user according to the weighted value of each user list, in terms of according to weighting
The result of calculation determines the output score value of each user.
Preferably, the system also includes:By degree of association score value in the user list for not removing the multiple basic user
The user of highest setting quantity is determined as extending user.
Preferably, according to the statistics of the network behavior off-line data of all users of data network, all users are extracted
User characteristics.
Preferably, the network behavior of the user includes:Search click behavior, browse webpage behavior and/or by the 3rd
The behavior that Fang Hezuo is obtained.
Preferably, the user characteristics includes:The host features of user, n-gram features, surf time section, belonging to online
Region and/or browse commodity behavior.
Preferably, the set for the multiple user properties selected according to client and the basic user quantity of client's input, really
Fixed the multiple basic user.
Preferably, the multiple training rules, including it is classification supervised training to the user characteristics, special to the user
The cluster training of sign and/or the semi-supervised training to the user characteristics.
Preferably, according to the multiple training rules, corresponding sample set is determined respectively;According in each sample set
User characteristics all users are analyzed, determine the degree of association of each user, and obtain the computation rule of calculation of relationship degree.
Preferably, according to the computation rule, all user's degree of being associated are calculated respectively, obtain the pass of each user
Connection degree score value;The user is ranked up according to the degree of association score value of each user.
Preferably, according to the degree of association score value of each user, the accuracy of corresponding computation rule is calculated;According to institute
State accuracy and weighted value is set to corresponding computation rule.
Preferably, according to the weighted value of each computation rule, each user obtained to each computation rule
Degree of association score value be weighted processing.
Preferably, by the weighting that every kind of computation rule of each user obtains handle as a result, being multiplied or phase
Add, obtain the output score value of each user.
Preferably, the user is ranked up according to the output score value of each user, the use of output setting quantity
Family is as extending user.
It is ranked up preferably, multiple users in data network are exported score value according to it, and to the result of sequence
It is adjusted according to user property.
Preferably, the use for the highest setting quantity of score value being exported in the user list for eliminating the multiple basic user
Family is determined as extending user.
Fig. 6 shows a kind of method that extending user is determined according to statistics interest-degree, the described method includes:
Step 601, the statistics associated with the network behavior of all users in data network is obtained, and to the system
Count and carry out feature extraction to determine the user characteristics of all users.
Data network can be general internet data or various dedicated networks., wherein it is desired to obtain data
The network behavior data of all users in network.The network behavior of user can pass through various operations of the user during online
To obtain, for example, which website can be logged in including user, which content is browsed, or watches which video to determine.User
Network behavior it is varied, can be the operation content of user or the behavior trace content of user.
Obtaining user network behavior can be carried out by recording the network behavior of user, can also be soft by various networks
Hardware obtains.In fact, the acquisition of user network behavior, is more sorted out and is recorded based on the analysis to user network behavior.
, can be by user network by various statistical analyses, it is necessary to carry out statistical analysis after the network behavior of user is recorded
Network behavior is sorted out.User network behavior includes polytype behavior and plurality of kinds of contents, it is necessary to classification storage.Deposited in classification
On the basis of storage, it is counted, so as to obtain statistics.Include in statistics all user network behaviors with
And the various possible network behaviors by user network behavior sorted generalization.The network behavior of user can include:Search is clicked on
Behavior, browse webpage behavior and/or the behavior obtained by third party's cooperation.
In fact, the analysis to user network behavior, can extract user characteristics.User characteristics is that user registration exists
Various actions feature on network.User characteristics is the motion characteristic of user, including operation behavior of the user on network and can
The action behavior of energy.User characteristics has characterized each action details of user's operation, so as to therefrom determine user's debarkation net
The behavioural habits of network and corresponding prediction is made to its behavior.
User characteristics can include the host features of user, n-gram features, surf time section, region belonging to online and/
Or browse commodity behavior.User characteristics is different from user property.User property is usually the fixed attribute of user, including with
The user identifier at family, age, IP etc. are attached to the attribute information of user itself.And user characteristics is dynamic during user's online
Make behavioural information, be the operability breath of user in a network, be dynamic.
Thus, in the environment of mass users, user property is attached to number of users, and magnanimity.And user characteristics,
Due to being related to a variety of behaviors of mass users, thus, its data volume is even more more huger than user attribute data.
Step 602, the extended requests that fellow users extension is carried out to basic user are received, the extended requests are solved
Analysis is with the setting quantity of definite extending user and multiple basic users.
Basic user is the user that client is provided as seed.Client can be that advertiser etc. has user is expanded
The client of exhibition.Basic user is put forward by client, is set according to the demand of client.Generally, the generation of basic user,
The set for the multiple user properties that can be selected according to client and the basic user quantity of client's input, determine the multiple base
Plinth user.Namely client sets basic user according to the user property that itself pays close attention to.
Client after above-mentioned look-alike systems are logged in, can select key user's attribute of itself concern, system
According to user property corresponding basic user dynamic listing and quantity are provided for client.The group of the user property inputted according to client
Change is closed, dynamic adjusts basic user dynamic listing, untill the quality and quantity of basic user meet the demand of client.
Client also needs to set at the same time the scale of required user's extension, namely client sets extension demand, according to expansion
Exhibition demand determines the quantity of extending user.Used for example, client can obtain 1,000,000 bases by the adjustment to user property
Family, then inputs overall extending user scale as 10,000,000.At this time, the scale of user's extension is 10 times.
Step 603, user characteristics in set sample set carry out signature analysis with determine to each user into
The computation rule of row calculation of relationship degree, and it is each in all users to calculate based on each computation rule in multiple computation rules
The initial association degree score value of user.
User characteristics in each sample set further carries out signature analysis, it may be determined that goes out and all users are associated
Spend the computation rule of analysis.Equally, each computation rule is to be directed to different training rules, orthogonal.
The present embodiment, sample set is respectively trained by multiple training rules, and computation rule is being extracted by sample set.Calculate
The extraction of rule is typically by the way of model training.User characteristics is analyzed from multiple dimensions, is therefrom filtered out most
Representative common characteristic, according to these feature combination user characteristic datas, filters out another from a large amount of any active ues
Criticize the user similar to seed crowd.Specifically:Firstly the need of selection computation model, computation model may include logistic
The models such as regression (logistic regression algorithm model) and/or linear SVM (supporting vector machine model), will be through above-mentioned sample
The sampled data of this concentration carries out model training using computation model, obtains effective computation model.
The computation rule obtained by above-mentioned model training, can do model prediction to the whole network user, be sorted based on prediction
Go out user of the prediction point more than certain threshold value as extension crowd, i.e., similar target audience crowd, the use that advertisement can be oriented
Family scope is contracted to more accurate user from extensive characteristic, meet advertiser to precisely and covering different demands,
Improve crowd's expansion efficiency.
According to the computation rule, calculating is compared in the user characteristics of multiple users in the data network one by one,
Each user and the degree of association score value of the basic user are assigned according to contrast conting result.Will be more in the data network
A user is ranked up according to its degree of association score value, and the result of sequence is adjusted according to user property.
According to the multiple training rules, corresponding sample set is determined respectively;According to the user in each sample set
Feature analyzes all users, determines the degree of association of each user, and obtains the computation rule of calculation of relationship degree.
According to the computation rule, all user's degree of being associated are calculated respectively, obtain the degree of association point of each user
Value;The user is ranked up according to the degree of association score value of each user.
Training rules are with including a variety of, it can be common that classification supervised training to the user characteristics, special to the user
The cluster training of sign and/or the semi-supervised training to the user characteristics.
Supervised learning training of classifying is instructed by existing training sample (i.e. given data and its corresponding output)
Practice, so as to obtain an optimal models, recycle this model that all new data samples are mapped as output accordingly as a result,
The simple purpose judged so as to fulfill classification is carried out to output result, then this optimal models is also just provided with to unknown number
According to the ability classified.
The unsupervised learning training of cluster is in advance without any training data sample, it is necessary to directly be built to data
Mould.It is generally necessary to train cluster centre by clustering algorithm, exercised supervision study using cluster centre as classifying rules.
In addition, also semi-supervised learning training pattern, i.e., with reference to supervised learning training pattern and unsupervised learning training mould
The prioritization scheme of type.For example, it may be the model that is clustered on basis of classification calculates or enterprising on cluster basis
One step thinks that the model of interference classification calculates.
Specific training rules can be selected according to being actually needed, and be a variety of different training rule of selection in the present embodiment
Then, the whole network user is trained respectively according to a variety of training rules, so that it is determined that sample set corresponding with a variety of training rules.
Each sample set be as obtained from corresponding training rules, it is uncorrelated mutually.
The degree of association score value of each user in all users is calculated based on each computation rule in multiple computation rules,
All users are ranked up according to the descending order of the degree of association score value to generate multiple user lists.
A variety of computation rules can calculate the degree of association score value of all users of a set of the whole network respectively.If for example, by three kinds
Computation rule, then each user of the whole network 3 degree of association score values can be calculated.According to the degree of association score value of each user, difference
The whole network user is arranged, obtains multiple user lists, namely every kind of computation rule corresponds to a user list, including
All user and respective degree of association score value put in order.
Weighted value is set for each user list according to the accuracy of each training rules, according to the power of each user list
The degree of association score value of each user is weighted in weight values, to determine that each user's is initial according to the result of weighted calculation
Degree of association score value.
According to the computation rule, all user's degree of being associated are calculated respectively, obtain the initial association of each user
Spend score value;The user is ranked up according to the initial association degree score value of each user.
Step 604, interest-degree extraction is carried out to the statistics to determine the interest-degree score value of each user, and base
Initial association degree score value is adjusted in interest-degree score value and corrects degree of association score value to generate.
Each user has the interest of uniqueness during surfing the Internet again, can be extracted in the statistics of mass users corresponding
Interest-degree, and then determine each user interest-degree score value.This interest-degree score value characterize each user relative to it is specific certain
The interest-degree of one user characteristics.For example, for the interest-degree of browse advertisements or particular advertisement, each use is different per family, has
User interest degree score value it is high, some user interest degree score values are low.
User characteristics is extracted according to the statistics of the network behavior of the user, according to the statistical number of the user characteristics
Extracted according to the interest-degree of extraction user.Corresponding interest-degree score value, the user are calculated according to the interest-degree of user statistics
There is different interest-degree score values relative to different interest-degrees.
Multiple basic users are obtained according to the fellow users that the client inputs, are determined according to the basic user relevant
Interest-degree, carries out all users according to the interest-degree interest-degree extraction and calculates the interest-degree point corresponding to the interest-degree
Value.
According to this interest-degree score value, the initial association degree score value of user can be corrected, obtain the correction degree of association
Score value.The interest-degree score value is multiplied or is added with initial association degree score value, generates corrected correction degree of association score value.
For example, user is zero for the interest-degree score value of browse advertisements, user shields all advertisements, no matter then the user
Initial association degree score value is how high, and in weighted correction after interest-degree score value, it is zero that it, which corrects degree of association score value,.
Step 605, all users are ranked up to generate user according to the descending order of the correction degree of association score value
List, the user of the highest setting quantity of the user list lieutenant colonel's positive association degree score value for eliminating the multiple basic user is true
It is set to extending user.
By the correction degree of association score value descending arrangement, the correction degree of association of setting quantity is chosen according to described put in order
The highest user of score value is as extending user.By all users in the data network according to the correction degree of association score value sequence
Afterwards, the highest user of correction degree of association score value for choosing setting quantity is determined as extending user.
, can be according to the size of correction degree of association score value, choosing after obtaining the relevant user's arrangement of specific degree of association score value
Take and wherein correct the higher certain customers of degree of association score value as extending user.Specific quantity is true according to the setting of client
It is fixed, can be the extending user scale amounts of client's setting.
Multiple users in data network are corrected degree of association score value according to it to be ranked up, and to the result root of sequence
It is adjusted according to user property.The highest setting quantity of score value will be exported in the user list for eliminating the multiple basic user
User be determined as extending user.
The basic user initially selected due to including client in the whole network user, and these basic users not necessarily correct pass
Connection degree score value is higher, therefore, it is possible to the selection according to client, it is determined whether needs in final extending user recommendation list
Delete basic user.
Delete basic user time can be before calculating correlation score value, can also calculating correlation score value it
Afterwards.Alternatively, can be before or after extending user be recommended.
In the present embodiment, multiple computation rules are determined by multiple training rules set in advance and sample set;According to
Multiple computation rules carry out each user is calculated to all users is directed to the degree of association score value of each computation rule, then counts
The weighted value of each computation rule is calculated, corresponds to the degree of association score value of each computation rule and corresponding computation rule with reference to user
Weighted value, each user's initial association degree score value of weighted calculation;In conjunction with the interest-degree score value of user, weighted calculation is final
Degree of association score value is corrected, the extending user of setting quantity is determined according to correction degree of association score value.Client can obtain actual with oneself
Audient's demographic data that demand matches, precision is high, can fully meet the different demands of client.
Fig. 7 shows a kind of system that extending user is determined according to statistics interest-degree provided in an embodiment of the present invention,
The system comprises:
User characteristics unit 701, for obtaining the statistical number associated with the network behavior of all users in data network
According to, and feature extraction is carried out to the statistics to determine the user characteristics of all users;
Basic user unit 702, carries out basic user for receiving the extended requests of fellow users extension, to the expansion
Exhibition request is parsed setting quantity and the multiple basic users to determine extending user;
Initial association degree computing unit 703, signature analysis is carried out for the user characteristics in set sample set
With determine to each user's degree of being associated calculate computation rule, and based on each computation rule in multiple computation rules come
Calculate the initial association degree score value of each user in all users;
Calculation of relationship degree unit 704 is corrected, for carrying out interest-degree extraction to the statistics with definite each user
Interest-degree score value, and based on interest-degree score value initial association degree score value is adjusted with generate correct degree of association score value;
Extending user unit 705, for being arranged according to the descending order of the correction degree of association score value all users
Sequence will eliminate the highest setting of user list lieutenant colonel's positive association degree score value of the multiple basic user to generate user list
The user of quantity is determined as extending user.
Preferably, user characteristics is extracted according to the statistics of the network behavior of the user, according to the user characteristics
Statistics extraction user interest-degree extraction.
Preferably, calculating corresponding interest-degree score value according to the interest-degree of user statistics, the user is not relative to
Same interest-degree has different interest-degree score values.
Preferably, multiple basic users are obtained according to the fellow users that the client inputs, it is true according to the basic user
All users are carried out interest-degree extraction according to the interest-degree and calculated corresponding to the interest-degree by fixed relevant interest-degree
Interest-degree score value.
Preferably, the system also includes:By degree of association score value in the user list for not removing the multiple basic user
The user of highest setting quantity is determined as extending user.
Preferably, according to the statistics of the network behavior off-line data of all users of data network, all users are extracted
User characteristics.
Preferably, the network behavior of the user includes:Search click behavior, browse webpage behavior and/or by the 3rd
The behavior that Fang Hezuo is obtained.
Preferably, the user characteristics includes:The host features of user, n-gram features, surf time section, belonging to online
Region and/or browse commodity behavior.
Preferably, the set for the multiple user properties selected according to client and the basic user quantity of client's input, really
Fixed the multiple basic user.
Preferably, the multiple training rules, including it is classification supervised training to the user characteristics, special to the user
The cluster training of sign and/or the semi-supervised training to the user characteristics.
Preferably, according to the multiple training rules, corresponding sample set is determined respectively;According in each sample set
User characteristics all users are analyzed, determine the degree of association of each user, and obtain the computation rule of calculation of relationship degree.
Preferably, according to the computation rule, all user's degree of being associated are calculated respectively, obtain the first of each user
Beginning degree of association score value;The user is ranked up according to the initial association degree score value of each user.
Preferably, the interest-degree score value to be multiplied or be added with initial association degree score value, corrected correction is generated
Degree of association score value.
Preferably, by the correction degree of association score value descending arrangement, the school of setting quantity is chosen according to described put in order
The highest user of positive association degree score value is as extending user.
Preferably, after all users in the data network are sorted according to the correction degree of association score value, setting is chosen
The highest user of correction degree of association score value of quantity is determined as extending user.
Fig. 8 shows a kind of method for being used to carry out user characteristics distributed coding provided in an embodiment of the present invention, institute
The method of stating includes:
Step 801, the statistics associated with the network behavior of all users in data network is obtained, and to the system
Count and carry out feature extraction to determine multiple user characteristicses and determine the total quantity of the multiple user characteristics.
In above-mentioned each embodiment, actually it is required to handle mass users feature, and in the mode handled, by
In the limitation of hardware condition, the calculation amount of mass users characteristic is excessively huge, it is difficult to completed in specific computer, because
And, it is necessary to after carrying out distributed coding to user characteristics, then carry out relevant calculating and processing.
Data network can be general internet data or various dedicated networks., wherein it is desired to obtain data
The network behavior data of all users in network.The network behavior of user can pass through various operations of the user during online
To obtain, for example, which website can be logged in including user, which content is browsed, or watches which video to determine.User
Network behavior it is varied, can be the operation content of user or the behavior trace content of user.
Obtaining user network behavior can be carried out by recording the network behavior of user, can also be soft by various networks
Hardware obtains.In fact, the acquisition of user network behavior, is more sorted out and is recorded based on the analysis to user network behavior.
, can be by user network by various statistical analyses, it is necessary to carry out statistical analysis after the network behavior of user is recorded
Network behavior is sorted out.User network behavior includes polytype behavior and plurality of kinds of contents, it is necessary to classification storage.Deposited in classification
On the basis of storage, it is counted, so as to obtain statistics.Include in statistics all user network behaviors with
And the various possible network behaviors by user network behavior sorted generalization.The network behavior of user can include:Search is clicked on
Behavior, browse webpage behavior and/or the behavior obtained by third party's cooperation.
In fact, the analysis to user network behavior, can extract user characteristics.User characteristics is that user registration exists
Various actions feature on network.User characteristics is the motion characteristic of user, including operation behavior of the user on network and can
The action behavior of energy.User characteristics has characterized each action details of user's operation, so as to therefrom determine user's debarkation net
The behavioural habits of network and corresponding prediction is made to its behavior.
User characteristics can include the host features of user, n-gram features, surf time section, region belonging to online and/
Or browse commodity behavior.User characteristics is different from user property.User property is usually the fixed attribute of user, including with
The user identifier at family, age, IP etc. are attached to the attribute information of user itself.And user characteristics is dynamic during user's online
Make behavioural information, be the operability breath of user in a network, be dynamic.
Thus, in the environment of mass users, user property is attached to number of users, and magnanimity.And user characteristics,
Due to being related to a variety of behaviors of mass users, thus, its data volume is even more more huger than user attribute data.
In the user characteristic data of magnanimity, it is necessary first to determine the total quantity of user characteristics.The sum of this user characteristics
Amount is change, is adjusted according to the demand of client, thus, for each user demand, it is required to redefine user
The total quantity of feature.
In fact, after client accesses look-alike systems, mission requirements are submitted.System recalls need according to mission requirements
The user characteristics and the processing quantity of user characteristics wanted.
The multiple user characteristics is subjected to classification rejecting, the number of the user characteristics of counting user demand according to user demand
Amount, the total quantity as user characteristics.
Step 802, structure includes the tag file of the multiple user characteristics, and will based on division rule set in advance
The tag file is divided into multiple subfiles.
For all user characteristicses, it is necessary first to build consumer profiles, include this in consumer profiles
All user characteristic datas of mission requirements.Then, it is necessary to be drawn all user characteristicses according to division rule set in advance
It is divided into multiple subfiles.User characteristics quantity in each subfile is roughly equal.Alternatively, the user in each subfile is special
Sign quantity can be determined according to the load capacity of specific processing equipment.
In general, multiple user characteristicses are respectively divided into different subfiles, it is necessary to using clustering algorithm.Will be the multiple
Division rule of the tag file of user characteristics based on hash function, is divided into more height corresponding to the processing number of nodes
File.Each user characteristics is divided into any one bucket according to hash function.
One of which clustering algorithm is referred to following proposal:
All user characteristicses to be encoded are divided into N barrels;
Calculating the quantity Array [i] of user characteristics described in each bucket, the i is the numbering of bucket, i=0,1,2,3 ...
N;
Accumulation and AccumulatedArray [i], the AccumulatedArray [i] are converted to the Array [i]
=AccumulatedArray [i-1]+Array [i];
Coding proceeded by from start_index [i]+1 to user characteristics in each bucket, the start_index [i]=
AccumulatedArray[i-1]。
This method may also include:Analyze the user characteristics to be encoded and filter its apoplexy involving the solid organs sample data, after avoiding influence
The accuracy of continuous disaggregated model.
Whole user characteristicses can be divided into N barrels according to user characteristics scale and calculate node regulation setting N, alternatively, can
Each user characteristics is divided into any one bucket using Hash hash functions, treats that whole user characteristicses calculate affiliated bucket
Afterwards, the user characteristics quantity of each bucket is calculated, this user characteristics quantity can be recorded with array, is denoted as Array [i], wherein,
I is the numbering of bucket, and i=0,1,2,3 ... N, Array [0] represent the user characteristics quantity that the 0th bucket contains, Array [i-1]
Represent the user characteristics quantity that the i-th -1 bucket contains.
The above-mentioned Array [i] calculated is converted into accumulation and AccumulatedArray [i], calculation formula is
AccumulatedArray [i]=AccumulatedArray [i-1]+Array [j], wherein AccumulatedArray [0]=
Array[0];Coding, the start_index are proceeded by from start_index [i]+1 to user characteristics in each bucket again
[i]=AccumulatedArray [i-1].
N number of calculate node can be called, each calculate node encodes the user characteristics in a bucket, specifically, first
First handle i-th, if i=0, remember start_index [0]=0, otherwise remember start_index=AccumulatedArray
[i-1];Coding is proceeded by from start_index [i]+1 to bucket interior element again.By the above process, you can complete distributed
User characteristics coding, and it can guarantee that the coding between each bucket does not have conflict.
Step 803, content scanning is carried out to the user characteristics in each subfile, to determine that user is special in each subfile
The quantity of sign.
Each subfile includes a certain number of user characteristicses, in order to accurately encode, it is necessary to each subfile
In user characteristics carry out content scanning, determine in each subfile determine user characteristics quantity.
Meanwhile between each subfile and have associated order, it is specific to need by encoding progress.The quantity of subfile
It can be determined according to the quantity of the processing unit of actual treatment task.Also need to filter obvious abnormal in the user characteristics
Dirty sample data.
Step 804, in the total quantity and each subfile of the space encoder based on user characteristics, the multiple user characteristics
The quantum count of user characteristics determines the coding subspace of user characteristics subset in each subfile.
The maximum group/cording quantity that can allow for according to the coding method determines the space encoder of the user characteristics.According to
The disposal ability of each processing node determines the coding subspace of each subfile.Thus, the volume of each subfile
Numeral space is actually that the user characteristics subset of corresponding each subfile determines.
Step 805, according to processing rule set in advance, each subfile and corresponding coding subspace are sent to more
In a processing node accordingly processing node with by it is corresponding processing node to the user characteristics in user characteristics subset into
Row coding.
Processing rule set in advance, as combines the disposal ability of each processing node, select the subfile that adapts into
Row processing.Saved when being encoded to user characteristics, it is necessary to which the corresponding subfile of corresponding character subset is sent corresponding processing
Point carries out.
In the present embodiment, the consumer profiles that all user characteristicses are formed are divided into corresponding to the multiple of processing node
Subfile, obtains the numbering of the user characteristics quantity and subfile in each subfile, subfile is sent to corresponding place respectively
Reason node is handled.
Fig. 9 shows a kind of method for being used to carry out user characteristics distributed coding provided in an embodiment of the present invention, bag
Include:
User characteristics unit 901, for obtaining the statistical number associated with the network behavior of all users in data network
According to, and feature extraction is carried out to the statistics to determine multiple user characteristicses and determine the sum of the multiple user characteristics
Amount;
Tag file construction unit 902, the tag file of the multiple user characteristics is included for building, and based on advance
The tag file is divided into multiple subfiles by the division rule of setting;
Feature quantity confirmation unit 903, it is every to determine for carrying out content scanning to the user characteristics in each subfile
The quantity of user characteristics in a subfile;
Encode subspace confirmation unit 904, for the space encoder based on user characteristics, the multiple user characteristics it is total
The quantum count of user characteristics determines the coding subspace of user characteristics subset in each subfile in quantity and each subfile;
Node allocation unit 905 is handled, for according to processing rule set in advance, each subfile to be compiled with corresponding
Numeral space is sent in multiple processing nodes handles node with sub to user characteristics by corresponding processing node accordingly
The user characteristics of concentration is encoded.
Preferably, behavior is clicked on according to the search of all users in the data network, webpage behavior is browsed and/or passes through
The behavior that third party's cooperation obtains, obtains the statistics that the network behavior of all users in the data network is associated.
Preferably, according to the host features of user, n-gram features, surf time section, online institute in the statistics
Possession domain and/or commodity behavior is browsed, carry out feature extraction, to determine multiple user characteristicses.
Preferably, the multiple user characteristics is subjected to classification rejecting, the user of counting user demand according to user demand
The quantity of feature, the total quantity as user characteristics.
Preferably, by division rule of the tag file of the multiple user characteristics based on hash function, correspondence is divided into
In multiple subfiles of the processing number of nodes.
Preferably, all user characteristicses are divided into N barrels;
Calculating the quantity Array [i] of user characteristics described in each bucket, the i is the numbering of bucket, i=0,1,2,3 ...
N;
Accumulation and AccumulatedArray [i], the AccumulatedArray [i] are converted to the Array [i]
=AccumulatedArray [i-1]+Array [i];
Coding proceeded by from start_index [i]+1 to user characteristics in each bucket, the start_index [i]=
AccumulatedArray[i-1]。
Preferably, each user characteristics is divided into any one bucket according to hash function.
Preferably, the quantity of the processing node is identical with the quantity of the subfile, each node that handles is to an institute
The user characteristics stated in bucket is encoded.
Preferably, as the i=0, AccumulatedArray [0]=Array [0].
Preferably, as the i=0, start_index [0]=0.
Preferably, filter the obvious abnormal dirty sample data in the user characteristics.
Preferably, the maximum group/cording quantity that can allow for according to the coding method determines that the coding of the user characteristics is empty
Between.
Preferably, the coding subspace of each subfile is determined according to the disposal ability of each processing node.
Further, the present embodiment provides a kind of mobile terminal, including or for performing as above any one embodiment institute
The system stated.
In the specification that this place provides, numerous specific details are set forth.It is to be appreciated, however, that the implementation of the present invention
Example can be put into practice in the case of these no details.In some instances, known method, structure is not been shown in detail
And technology, so as not to obscure the understanding of this description.
Similarly, it will be appreciated that in order to simplify the disclosure and help to understand one or more of each inventive aspect,
Above in the description to the exemplary embodiment of the present invention, each feature of the invention is grouped together into single implementation sometimes
In example, figure or descriptions thereof.However, the method for the disclosure should be construed to reflect following intention:I.e. required guarantor
The application claims of shield features more more than the feature being expressly recited in each claim.It is more precisely, such as following
Claims reflect as, inventive aspect is all features less than single embodiment disclosed above.Therefore,
Thus the claims for following embodiment are expressly incorporated in the embodiment, wherein each claim is in itself
Separate embodiments all as the present invention.
Those skilled in the art, which are appreciated that, to carry out adaptively the module in the equipment in embodiment
Change and they are arranged in one or more equipment different from the embodiment.Can be the module or list in embodiment
Member or component be combined into a module or unit or component, and can be divided into addition multiple submodule or subelement or
Sub-component.In addition at least some in such feature and/or process or unit exclude each other, it can use any
Combination is disclosed to all features disclosed in this specification (including adjoint claim, summary and attached drawing) and so to appoint
Where all processes or unit of method or equipment are combined.Unless expressly stated otherwise, this specification (including adjoint power
Profit requires, summary and attached drawing) disclosed in each feature can be by providing identical, equivalent or similar purpose alternative features come generation
Replace.
In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodiments
In included some features rather than further feature, but the combination of the feature of different embodiments means in of the invention
Within the scope of and form different embodiments.For example, embodiment claimed in detail in the claims is one of any
Mode it can use in any combination.
The all parts embodiment of the present invention can be with hardware realization, or to be run on one or more processor
Software module realize, or realized with combinations thereof.The present invention is also implemented as being used to perform side as described herein
The some or all equipment or program of device (for example, computer program and computer program product) of method.It is such
Realizing the program of the present invention can store on a computer-readable medium, or can have the shape of one or more signal
Formula.Such signal can be downloaded from internet website and obtained, and either be provided or with any other shape on carrier signal
Formula provides.
It should be noted that the present invention will be described rather than limits the invention for above-described embodiment, and ability
Field technique personnel can design alternative embodiment without departing from the scope of the appended claims.Word "comprising" is not arranged
Except there are element or step not listed in the claims.Word "a" or "an" before element does not exclude the presence of more
A such element.The present invention can be by means of including the hardware of some different elements and by means of properly programmed calculating
Machine is realized.In if the unit claim of equipment for drying is listed, several in these devices can be by same
Hardware branch embodies.
The above is only the embodiment of the present invention, it is noted that for the ordinary skill people of this area
Member for, without departing from the spirit of the invention, can make it is some improve, modification and deformation, these improve, modification,
It is regarded as in the protection domain of the application with deformation.
Claims (10)
1. a kind of method that extending user is determined according to the statistics degree of association, the described method includes:
The statistics associated with the network behavior of all users in data network is obtained, and the statistics is carried out special
Sign extraction is with the user characteristics of definite all users;
The extended requests that fellow users extension is carried out to basic user are received, the extended requests are parsed to determine extension
The setting quantity of user and the positive sample collection including multiple basic users;
The negative sample for determining to include multiple training users according to all users in the data network and the multiple basic user
This collection, wherein the ratio of the basic user and the quantity of training user is less than or equal to predetermined threshold;
The user characteristics of the multiple training users concentrated to the negative sample carries out signature analysis, to determine to be used for each user
The computation rule that degree of being associated calculates;
The degree of association score value of each user in all users is calculated based on the computation rule, according to the degree of association score value
Descending order is ranked up all users to generate user list;And
The user of the highest setting quantity of degree of association score value is determined as in the user list that the multiple basic user will be eliminated
Extending user.
2. the method as described in claim 1, the method further includes:The user list of the multiple basic user will not be removed
The user of the middle highest setting quantity of degree of association score value is determined as extending user.
3. the method as described in claim 1, according to the statistics of the network behavior off-line data of all users of data network, extraction
The user characteristics of all users.
4. method as claimed in claim 3, the network behavior of the user includes:Search click behavior, browse webpage behavior
And/or the behavior obtained by third party's cooperation.
5. the method as described in claim 3 or 4, the user characteristics includes:The host features of user, n-gram features, on
Region and/or commodity behavior is browsed belonging to net period, online.
6. a kind of system that extending user is determined according to the statistics degree of association, the system comprises:
User characteristics unit, for obtaining the statistics associated with the network behavior of all users in data network, and it is right
The statistics carries out feature extraction to determine the user characteristics of all users;
Positive sample collection unit, carries out basic user for receiving the extended requests of fellow users extension, to the extended requests
Parsed the setting quantity to determine extending user and the positive sample collection including multiple basic users;
Negative sample collection unit is more for determining to include according to all users in the data network and the multiple basic user
The negative sample collection of a training user, wherein the ratio of the basic user and the quantity of training user is less than or equal to predetermined threshold
Value;
Computation rule unit, the user characteristics of multiple training users for being concentrated to the negative sample carry out signature analysis, with
Determine the computation rule for calculating each user's degree of being associated;
Calculation of relationship degree unit, for calculating the degree of association score value of each user in all users based on the computation rule,
All users are ranked up to generate user list according to the descending order of the degree of association score value;And
Extending user unit, for by the highest setting of degree of association score value in the user list for eliminating the multiple basic user
The user of quantity is determined as extending user.
7. system as claimed in claim 6, the system also includes:The user list of the multiple basic user will not be removed
The user of the middle highest setting quantity of degree of association score value is determined as extending user.
8. system as claimed in claim 6, according to the statistics of the network behavior off-line data of all users of data network, extraction
The user characteristics of all users.
9. system as claimed in claim 8, the network behavior of the user includes:Search click behavior, browse webpage behavior
And/or the behavior obtained by third party's cooperation.
10. system as claimed in claim 8 or 9, the user characteristics includes:The host features of user, n-gram features, on
Region and/or commodity behavior is browsed belonging to net period, online.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711446826.0A CN108038739A (en) | 2017-12-27 | 2017-12-27 | A kind of method and system that extending user is determined according to the statistics degree of association |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711446826.0A CN108038739A (en) | 2017-12-27 | 2017-12-27 | A kind of method and system that extending user is determined according to the statistics degree of association |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108038739A true CN108038739A (en) | 2018-05-15 |
Family
ID=62097527
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711446826.0A Pending CN108038739A (en) | 2017-12-27 | 2017-12-27 | A kind of method and system that extending user is determined according to the statistics degree of association |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108038739A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109409419A (en) * | 2018-09-30 | 2019-03-01 | 北京字节跳动网络技术有限公司 | Method and apparatus for handling data |
CN111563761A (en) * | 2020-01-19 | 2020-08-21 | 深圳前海微众银行股份有限公司 | Crowd expansion method, device, equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120047014A1 (en) * | 2010-08-23 | 2012-02-23 | Yahoo! Inc. | Method and system for using email receipts for targeted advertising |
US20150339754A1 (en) * | 2014-05-22 | 2015-11-26 | Craig J. Bloem | Systems and methods for customizing search results and recommendations |
CN105447730A (en) * | 2015-12-25 | 2016-03-30 | 腾讯科技(深圳)有限公司 | Target user orientation method and device |
CN105931079A (en) * | 2016-04-29 | 2016-09-07 | 合网络技术(北京)有限公司 | Method and apparatus for diffusing seed users |
-
2017
- 2017-12-27 CN CN201711446826.0A patent/CN108038739A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120047014A1 (en) * | 2010-08-23 | 2012-02-23 | Yahoo! Inc. | Method and system for using email receipts for targeted advertising |
US20150339754A1 (en) * | 2014-05-22 | 2015-11-26 | Craig J. Bloem | Systems and methods for customizing search results and recommendations |
CN105447730A (en) * | 2015-12-25 | 2016-03-30 | 腾讯科技(深圳)有限公司 | Target user orientation method and device |
CN105931079A (en) * | 2016-04-29 | 2016-09-07 | 合网络技术(北京)有限公司 | Method and apparatus for diffusing seed users |
Non-Patent Citations (1)
Title |
---|
李友诚,安月英主编: "《数字图书馆研究》", 30 September 2008, 西安地图出版社 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109409419A (en) * | 2018-09-30 | 2019-03-01 | 北京字节跳动网络技术有限公司 | Method and apparatus for handling data |
CN111563761A (en) * | 2020-01-19 | 2020-08-21 | 深圳前海微众银行股份有限公司 | Crowd expansion method, device, equipment and storage medium |
CN111563761B (en) * | 2020-01-19 | 2024-06-07 | 深圳前海微众银行股份有限公司 | Crowd expansion method, device, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Reddy et al. | Content-based movie recommendation system using genre correlation | |
CN109902849B (en) | User behavior prediction method and device, and behavior prediction model training method and device | |
US9934515B1 (en) | Content recommendation system using a neural network language model | |
CN110532479A (en) | A kind of information recommendation method, device and equipment | |
Cleger-Tamayo et al. | Top-N news recommendations in digital newspapers | |
CN111898032B (en) | Information recommendation method and device based on artificial intelligence, electronic equipment and storage medium | |
Farrahi et al. | A probabilistic approach to mining mobile phone data sequences | |
US9864951B1 (en) | Randomized latent feature learning | |
CN104216960A (en) | Method and device for recommending video | |
CN104462594A (en) | Method and device for providing user personalized resource message pushing | |
CN106168980A (en) | Multimedia resource recommends sort method and device | |
CN104217030A (en) | Method and device for classifying users according to search log data of server | |
Xiao et al. | A time-sensitive personalized recommendation method based on probabilistic matrix factorization technique | |
CN110032678A (en) | Service resources method for pushing and device, storage medium and electronic device | |
CN108182600A (en) | A kind of method and system that extending user is determined according to weighted calculation | |
Zhong et al. | Design of a personalized recommendation system for learning resources based on collaborative filtering | |
Dong et al. | Improving sequential recommendation with attribute-augmented graph neural networks | |
CN108038739A (en) | A kind of method and system that extending user is determined according to the statistics degree of association | |
CN108053260A (en) | A kind of method and system that extending user is determined according to statistics interest-degree | |
CN107943932B (en) | Item recommendation method, storage device and terminal | |
Xu et al. | Predicting smartphone app usage with recurrent neural networks | |
CN113704620A (en) | User label updating method, device, equipment and medium based on artificial intelligence | |
CN116823410A (en) | Data processing method, object processing method, recommending method and computing device | |
CN108182235A (en) | A kind of method and system for being used to carry out user characteristics distributed coding | |
Lee et al. | A study on the context-aware hybrid bayesian recommender system on the mobile devices |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180515 |
|
RJ01 | Rejection of invention patent application after publication |