Invention content
In order to solve problem of the prior art, recognition methods and dress an embodiment of the present invention provides a kind of user property
It puts.The technical solution is as follows:
On the one hand, an embodiment of the present invention provides a kind of recognition methods of user property, the method includes:
First sample user set is obtained, is included on platform in the first sample user set and registers and preserve category
The user of property information;
Obtain user in the first sample user set first plays set of records ends, and described first plays set of records ends
The multimedia document information played including user;
First sample user set and the first broadcasting set of records ends are screened, obtain the second sample of users collection
It closes and second plays set of records ends;
Set of records ends is played based on the second sample of users set and second, generates eigenmatrix, the eigenmatrix
Include the feature vector of user each in the second sample of users set, the feature vector of each user is according to described every
The multimedia document information generation that a user is played;
Based on the feature vector and the attribute information of described eigenvector in the eigenmatrix, disaggregated model is built;
It is recorded according to the broadcasting of user to be identified, generates the feature vector of the user to be identified;
The feature vector of the user to be identified is inputted into the disaggregated model, the user for exporting the user to be identified belongs to
Property.
Optionally, the first broadcasting set of records ends for obtaining user in the first sample user set includes:
Obtain the multimedia file letter that each user is played in preset time period in the first sample user set
Breath.
Optionally, it is described that first sample user set and the first broadcasting set of records ends are screened, obtain the
Two sample of users set and second play set of records ends, including:
It is screened out in gathering from the first sample user in preset time period and plays multimedia file number less than first in advance
If the user of threshold value, the second sample of users set is obtained;
It is screened out from the described first broadcasting set of records ends and number is played in the preset time period less than the second default threshold
The multimedia file of value obtains the second broadcasting set of records ends.
Optionally, set of records ends is played based on the second sample of users set and second, generation eigenmatrix includes:
For any one user in the second sample of users set, each multimedia that the user played is counted
Word frequency and inverse document frequency of the file in the described second broadcasting set of records ends;
According to the user through counting the obtained word frequency and inverse document frequency of each multimedia file, each more matchmakers are generated
The vector element of body file;
The vector element of each multimedia file is combined, obtains the broadcasting score value vector of the user;
The broadcasting score value vector of each user in the second sample of users set is combined, obtains playing score value square
Battle array;
The broadcasting score matrix is subjected to dimensionality reduction, is arranged from big to small according to the characteristic value after dimensionality reduction, and the before choosing
One preset number vector composition eigenmatrix.
Optionally, based on the feature vector and the attribute information of described eigenvector in the eigenmatrix, structure classification
Model includes:
Attribute information based on the first eigenvector in the eigenmatrix and the first eigenvector is trained,
Preliminary classification model is generated, the first eigenvector is preceding second preset number feature vector;
Attribute information based on second feature vector described in the second feature vector sum in the eigenmatrix is to described first
Beginning disaggregated model is verified and is adjusted, and obtains the disaggregated model, and the second feature vector is to be removed in the eigenmatrix
Feature vector other than the first eigenvector.
On the other hand, an embodiment of the present invention provides a kind of identification device of user property, described device includes:
User gathers acquisition module, and for obtaining first sample user set, the first sample user set includes
The user of attribute information is registered and preserved on platform;
Set acquisition module is played, first for obtaining user in the first sample user set plays record set
It closes, the first broadcasting set of records ends includes the multimedia document information that user is played;
Screening module for being screened to first sample user set and the first broadcasting set of records ends, obtains
Second sample of users set and second plays set of records ends;
Matrix generation module plays set of records ends for being based on the second sample of users set and second, generates feature
Matrix, the eigenmatrix include the feature vector of each user in the second sample of users set, each user's
Feature vector is generated according to the multimedia document information that each user is played;
Modeling module, for based on the feature vector and the attribute information of described eigenvector in the eigenmatrix, structure
Build disaggregated model;
Vector generation module, for being recorded according to the broadcasting of user to be identified, generate the feature of the user to be identified to
Amount;
Identification module for the feature vector of the user to be identified to be inputted the disaggregated model, is waited to know described in output
The user property of other user.
Optionally, it is described to play set acquisition module for obtaining in first sample user set each user pre-
If the multimedia document information played in the period.
Optionally, the screening module is used in from the first sample user gathering screen out in preset time period to play
Multimedia file number is less than the user of the first predetermined threshold value, obtains the second sample of users set;Record set is played from described first
The multimedia file that broadcasting number in the preset time period is less than the second predetermined threshold value is screened out in conjunction, obtains the second broadcasting note
Record set.
Optionally, the matrix generation module is used for for any one user in the second sample of users set, system
It counts each multimedia file that the user played and plays word frequency and inverse document frequency in set of records ends described second;Root
According to the user through counting the obtained word frequency and inverse document frequency of each multimedia file, generate each multimedia file to
Secondary element;The vector element of each multimedia file is combined, obtains the broadcasting score value vector of the user;By described
The broadcasting score value vector combination of each user in two sample of users set, obtains playing score matrix;By the broadcasting score value
Matrix carries out dimensionality reduction, is arranged from big to small according to the characteristic value after dimensionality reduction, and the first preset number vector composition is special before selection
Levy matrix.
Optionally, the modeling module is used for based on the first eigenvector in the eigenmatrix and the fisrt feature
The attribute information of vector is trained, and generates preliminary classification model, and the first eigenvector is special for preceding second preset number
Sign vector;Attribute information based on second feature vector described in the second feature vector sum in the eigenmatrix is to described initial
Disaggregated model is verified and is adjusted, and obtains the disaggregated model, the second feature vector is removes institute in the eigenmatrix
State the feature vector other than first eigenvector.
The advantageous effect that technical solution provided in an embodiment of the present invention is brought is:
Users of attribute information are left by using some to model the broadcasting record of multimedia file, it can be with
The disaggregated model for carrying out Attribute Recognition is obtained, is recorded so as to be played based on the history of user to be identified, predicts that this is treated
It identifies the attribute informations such as gender, the age of user, to obtain the basis for carrying out user service, such as Multimedia Recommendation can be improved
The accuracy of user service.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached drawing to embodiment party of the present invention
Formula is described in further detail.
Fig. 1 is a kind of flow chart of the recognition methods of user property provided in an embodiment of the present invention.Referring to Fig. 1, the side
Method includes:
101st, first sample user set is obtained, is included on platform in the first sample user set and is registered and preserve
There is the user of attribute information.
102nd, obtain user in the first sample user set first plays set of records ends, and described first plays record
Set includes the multimedia document information that user is played.
103rd, first sample user set and the first broadcasting set of records ends are screened, obtains the second sample use
Family is gathered and second plays set of records ends.
104th, set of records ends is played based on the second sample of users set and second, generates eigenmatrix, the feature
Matrix includes the feature vector of each user in the second sample of users set, and the feature vector of each user is according to institute
State the multimedia document information generation that each user is played.
105th, based on the feature vector and the attribute information of described eigenvector in the eigenmatrix, structure classification mould
Type.
106th, it is recorded according to the broadcasting of user to be identified, generates the feature vector of the user to be identified.
107th, the feature vector of the user to be identified is inputted into the disaggregated model, exports the use of the user to be identified
Family attribute.
Optionally, the first broadcasting set of records ends for obtaining user in the first sample user set includes:
Obtain the multimedia file letter that each user is played in preset time period in the first sample user set
Breath.
Optionally, it is described that first sample user set and the first broadcasting set of records ends are screened, obtain the
Two sample of users set and second play set of records ends, including:
It is screened out in gathering from the first sample user in preset time period and plays multimedia file number less than first in advance
If the user of threshold value, the second sample of users set is obtained;
It is screened out from the described first broadcasting set of records ends and number is played in the preset time period less than the second default threshold
The multimedia file of value obtains the second broadcasting set of records ends.
Optionally, set of records ends is played based on the second sample of users set and second, generation eigenmatrix includes:
For any one user in the second sample of users set, each multimedia that the user played is counted
Word frequency and inverse document frequency of the file in the described second broadcasting set of records ends;
According to the user through counting the obtained word frequency and inverse document frequency of each multimedia file, each more matchmakers are generated
The vector element of body file;
The vector element of each multimedia file is combined, obtains the broadcasting score value vector of the user;
The broadcasting score value vector of each user in the second sample of users set is combined, obtains playing score value square
Battle array;
The broadcasting score matrix is subjected to dimensionality reduction, is arranged from big to small according to the characteristic value after dimensionality reduction, and the before choosing
One preset number vector composition eigenmatrix.
Optionally, based on the feature vector and the attribute information of described eigenvector in the eigenmatrix, structure classification
Model includes:
Attribute information based on the first eigenvector in the eigenmatrix and the first eigenvector is trained,
Preliminary classification model is generated, the first eigenvector is preceding second preset number feature vector;
Attribute information based on second feature vector described in the second feature vector sum in the eigenmatrix is to described first
Beginning disaggregated model is verified and is adjusted, and obtains the disaggregated model, and the second feature vector is to be removed in the eigenmatrix
Feature vector other than the first eigenvector.
The alternative embodiment that any combination forms the disclosure may be used, herein no longer in above-mentioned all optional technical solutions
It repeats one by one.
Fig. 2 is a kind of flow chart of the recognition methods of user property provided in an embodiment of the present invention.Referring to Fig. 2, the implementation
Example specifically includes:
201st, first sample user set is obtained, is included on platform in first sample user set and is registered and preserve
The user of attribute information.
Before user property identification is carried out, it can be obtained from the corresponding user profile database of platform with attribute letter
The user of breath, the attribute information can refer to user's gender and age of user etc., can determine to study carefully according to current identification target
Unexpectedly which kind of user is obtained, if currently identification target is user's gender, typing can be obtained from user profile database
The user of user's gender if currently identification target is age of user, can obtain from user profile database and record
Enter the user of age of user, it certainly, can be from user when identifying target as two dimension attribute of user's gender and age of user
Document data base, which occupies, obtains the user of age of user and gender of typing, can also for other kinds of user property
Using this kind of acquisition modes, the embodiment of the present invention does not repeat this.
It should be noted that the platform can refer to immediate communication platform, social networking application platform, Multi-media Service Platform or
Other letters provide the platform of information service, and the embodiment of the present invention is not construed as limiting this.
Certainly, in order to ensure that sample is comprehensive, a certain number of users, example can be included in first sample user set
Such as, 30 general-purpose families can be obtained as first sample user to gather, which, which is only for example, is used, to the reality of the embodiment of the present invention
Border is not limited using number.
202nd, obtain user in first sample user set first plays set of records ends, the first broadcasting set of records ends
The multimedia document information played including user.
Optionally, which can include:It is played in database of record from user's history, obtains first sample use
The multimedia document information that each user is played in preset time period in the set of family.It should be noted that when obtaining, need
Preset time period is set, to avoid influence of the excessively outmoded broadcasting record to current modeling process, enabling use is newer
Broadcasting record carry out user property identification.The preset time period can be nearest 6 months in or it is 3 months nearest in etc. to user
The actual play behavior representational period.
203rd, it is screened out in gathering from first sample user in preset time period and plays multimedia file number less than first
The user of predetermined threshold value obtains the second sample of users set.
For a user, if the multimedia file number that the user plays in preset time period is pre- less than first
If threshold value, the broadcasting preference of the user can not accurately be weighed by playing record, therefore, it is necessary to be screened to this kind of user,
The users that number of songs is no more than 15 head are played within half a year for example, concentrating and screening out from first sample user.
Above-mentioned the second sample of users set by screening is denoted as { uj| j=1,2 ..., n, n+1, n+2 ... n+m }.
Wherein, the preceding n user in set can be as training set user, and the rear m user in set can collect user as verification.
204th, it is screened out from the first broadcasting set of records ends and number is played in the preset time period less than the second default threshold
The multimedia file of value obtains the second broadcasting set of records ends.
And for a multimedia file, if broadcasting number of the multimedia file in preset time period is less than
Second predetermined threshold value, then its value as measurement user's broadcasting preference is also relatively small, and therefore, it is necessary to this kind of multimedia text
Part is screened, for example, in above-mentioned half a year user plays record, removes multimedia text of the broadcasting number less than 6 in half a year
Part.Still remaining multimedia file can be used as multimedia file dictionary after processing, be denoted as { Sj| j=1,2 ..., k }, k is
Multimedia file number.
Above-mentioned steps 203,204 are that first sample user set and the first broadcasting set of records ends are screened, and are obtained
The process of set of records ends is played to the second sample of users set and second.The first predetermined threshold value used in screening process and
Second predetermined threshold value can be adjusted according to actual scene, and the embodiment of the present invention is not especially limited this.
205th, set of records ends is played based on the second sample of users set and second, generates eigenmatrix, the feature
Matrix includes the feature vector of each user in the second sample of users set, and the feature vector of each user is according to institute
State the multimedia document information generation that each user is played.
Specifically, which has procedure below:
205A, for each user in the second sample of users set, count each more matchmakers that the user played
Word frequency and inverse document frequency of the body file in the second broadcasting set of records ends.
Specifically, each user u is countediTo multimedia file SjBroadcasting time fij, then broadcasting time is carried out certainly
Right Logarithm conversion tfij=ln (fij), obtain the word frequency of each multimedia file.Then, the inverse text of each multimedia file is calculated
Shelves frequencyWherein, njRepresent multimedia file SjBroadcasting number.
205B, the word frequency and inverse document frequency of each multimedia file obtained according to the user through statistics, generation are each
The vector element of multimedia file.
Word frequency is multiplied w with inverse document frequencyij=tfij*idfj, you can obtain the vector of the multimedia file of the user
Element di=(wi1, wi2... wik), wherein, i=1,2 ..., n+m, j=1,2 ..., k.
205C, the vector element of each multimedia file is combined, obtains the broadcasting score value vector of the user.
By the way that the vector element of multiple multimedia files of the user is combined, obtain the broadcasting score value of user to
Amount.
Above-mentioned steps 205A-205C is actually based on tf-idf (Term Frequency-Inverse Document
Frequency, term frequency-inverse document frequency) model, obtain each user broadcasting score value vector process, the broadcasting score value to
Amount can be if-idf vectors.In this process, regard each user as a document, each multimedia that user is played
File regards the word in document as, so as to generate the k dimensional vectors of the user.
205D, the broadcasting score value vector of each user in the second sample of users set is combined, obtains playing score value
Matrix.
The broadcasting score value vector of user can be used as column vector, so as to form broadcasting score matrix, it is, of course, also possible to conduct
Row vector, so as to form broadcasting score matrix, the embodiment of the present invention is not construed as limiting this.
205E, the broadcasting score matrix is subjected to dimensionality reduction, is arranged from big to small according to the characteristic value after dimensionality reduction, and before selection
First preset number vector composition eigenmatrix.
It should be noted that the dimension of the broadcasting score matrix plays the multimedia file number one in set of records ends with second
It causes, is a very huge dimension therefore, e.g., the multimedia file number in the second broadcasting set of records ends can reach 110,000
It is a, it that is to say, the broadcasting score matrix of user-multimedia file being combined by the broadcasting score value vector of m+n user is up to
Therefore 110000 dimensions, sparse rate, need to carry out dimension-reduction treatment up to more than 99.5% to the matrix.Optionally, the embodiment of the present invention can be with
Dimensionality reduction is carried out to matrix using svd algorithm.
Based on SVD (Singular Value Decomposition, singular value decomposition) algorithm, to the matrix M of a n*m
Carry out matrix decomposition.Mn×m=Un×nSn×mVT m×m, wherein S is characteristic value, is arranged from big to small.For orthogonal basis Vm×m, there is VT m× mVm×m=1, therefore Mn×mVm×m=Un×nSn×m.R ties up M before extractionn×mVm×r≈Un×rSr×r, Un×rSr×rIt is exactly low after dimensionality reduction
Dimension space vector.Therefore, the broadcasting score matrix M being combined into now to the broadcasting score value vector of above-mentioned m+n user(n+m)×kInto
Row decomposes, and is arranged from big to small according to the characteristic value after dimensionality reduction, and the first preset number vector before extracting, by this first in advance
If number vector one eigenmatrix of composition, that is to say, r dimensional features U before extraction(n+m)×rSr×r, optionally, r can take 200~
500, such as r=300.
The reduction process can essentially be regarded as maps to a r dimension lower dimensional spaces (k by a matrix from k dimension spaces>
>R), the embodiment of the present invention can also carry out the process using other dimension reduction methods, e.g., PLSA (Probabilistic Latent
Semantic Analysis, probability latent semantic analysis algorithm), LDA (Latent Dirichlet Allocation, it is potential
Di Li Crays allocation algorithm) etc. hidden semantic extraction technique, this is not repeated.
206th, it is carried out based on the attribute information of the first eigenvector in the eigenmatrix and the first eigenvector
Training, generates preliminary classification model, and the first eigenvector is preceding second preset number feature vector.
The model training process can essentially be that the process of grader is built using regression algorithm, in this process, right
In this two classification problem of gender, logistic regression algorithm is used to classify.Logistic regression is a kind of extremely intelligible model,
Y=f (x) is equivalent to, shows the relationship of independent variable x and dependent variable y.Most common problem just like when attending prestige, news,
Ask, cut, judge later patient it is whether sick or sick what, the four methods of diagnosis therein are just obtained from variable x, i.e. characteristic
According to, judge whether it is sick be equivalent to obtain dependent variable y, i.e., prediction classification.In the step 206, square after dimensionality reduction can be used
Preceding n vector and its attribute label (such as gender or age) in battle array, one preliminary classification model of training.
207th, the attribute information based on second feature vector described in the second feature vector sum in the eigenmatrix is to institute
It states preliminary classification model to be verified and adjusted, obtains the disaggregated model, the second feature vector is the eigenmatrix
In feature vector in addition to the first eigenvector.
In order to which the prediction accuracy to disaggregated model is verified, the rear m vector in matrix after dimensionality reduction can also be utilized
And its attribute label (such as gender or age) is verified and is adjusted to it, is exported confusion matrix, is based ultimately upon confusion matrix,
Obtain disaggregated model.
Above-mentioned steps 206-207 is believed based on the attribute of the feature vector in the eigenmatrix and described eigenvector
Breath builds the process of disaggregated model.
208th, it is recorded according to the broadcasting of user to be identified, generates the feature vector of the user to be identified.
The process of the generation feature vector is every in the second sample of users set with being generated in above-mentioned steps 205A-205E
Similarly, therefore not to repeat here for the process of the feature vector of one user.Specifically, in order to enable the feature of the user to be identified to
Amount can be matched with disaggregated model, therefore, it is also desirable to dimensionality reduction be carried out to the broadcasting score value vector, based in above-mentioned steps 205
Citing, if the broadcasting score value vector of the user to be identified is Ak×m, A can be passed throughk×mVm×rCarry out dimensionality reduction.
209th, the feature vector of the user to be identified is inputted into the disaggregated model, exports the use of the user to be identified
Family attribute.
Disaggregated model can classify to the feature vector after dimensionality reduction, to obtain the user property of the user to be identified.
Method provided in an embodiment of the present invention has left the user of attribute information to multimedia file by using some
Broadcasting record modeled, be available for carry out Attribute Recognition disaggregated model, so as to be based on user to be identified
History play record, predict the attribute informations such as gender, the age of the user to be identified, with obtain carry out user service base
Plinth can improve the accuracy of such as Multimedia Recommendation user service.Further, by regarding multimedia file as text
This, MultiMedia Field is introduced by the thought of text classification, each user corresponds to a document, and user plays each
A multimedia file corresponds to the word in document, according to the history multimedia behavior of user to Sex, Age of user et al.
Mouth attribute is predicted, is realized in MultiMedia Field and is utilized prediction of the broadcasting behavior of user to attributes such as user's genders.
Fig. 3 is a kind of structure diagram of the identification device of user property provided in an embodiment of the present invention.Referring to Fig. 3, institute
Device is stated to include:
User gathers acquisition module 301, for obtaining first sample user set, is wrapped in the first sample user set
Include the user that attribute information is registered and preserved on platform;
Set acquisition module 302 is played, first for obtaining user in the first sample user set plays record
Set, the first broadcasting set of records ends include the multimedia document information that user is played;
Screening module 303 for being screened to first sample user set and the first broadcasting set of records ends, obtains
Set of records ends is played to the second sample of users set and second;
Matrix generation module 304 plays set of records ends for being based on the second sample of users set and second, and generation is special
Matrix is levied, the eigenmatrix includes the feature vector of each user in the second sample of users set, each user
The multimedia document information that is played according to each user of feature vector generate;
Modeling module 305, for based on the feature vector and the attribute information of described eigenvector in the eigenmatrix,
Build disaggregated model;
Vector generation module 306 for being recorded according to the broadcasting of user to be identified, generates the feature of the user to be identified
Vector;
Identification module 307 for the feature vector of the user to be identified to be inputted the disaggregated model, is treated described in output
Identify the user property of user.
Optionally, it is described to play set acquisition module 302 for obtaining each user in the first sample user set
The multimedia document information played in preset time period.
Optionally, the screening module 303 is used in from the first sample user gathering screen out in preset time period
The user that multimedia file number is less than the first predetermined threshold value is played, obtains the second sample of users set;Note is played from described first
The multimedia file that broadcasting number in the preset time period is less than the second predetermined threshold value is screened out in record set, second is obtained and broadcasts
Put set of records ends.
Optionally, the matrix generation module 304 is used for for any one user in the second sample of users set,
It counts each multimedia file that the user played and plays word frequency and inverse document frequency in set of records ends described second;
According to the user through counting the obtained word frequency and inverse document frequency of each multimedia file, each multimedia file is generated
Vector element;The vector element of each multimedia file is combined, obtains the broadcasting score value vector of the user;By described in
The broadcasting score value vector combination of each user in second sample of users set, obtains playing score matrix;Described play is divided
Value matrix carries out dimensionality reduction, is arranged from big to small according to the characteristic value after dimensionality reduction, and the first preset number vector composition before selection
Eigenmatrix.
Optionally, the modeling module 305 is used for based on the first eigenvector in the eigenmatrix and described first
The attribute information of feature vector is trained, and generates preliminary classification model, and the first eigenvector is preceding second preset number
A feature vector;Attribute information based on second feature vector described in the second feature vector sum in the eigenmatrix is to described
Preliminary classification model is verified and is adjusted, and obtains the disaggregated model, and the second feature vector is in the eigenmatrix
Feature vector in addition to the first eigenvector.
The alternative embodiment that any combination forms the disclosure may be used, herein no longer in above-mentioned all optional technical solutions
It repeats one by one.
It should be noted that:The identification device for the user property that above-described embodiment provides is in the identification of user property, only
With the division progress of above-mentioned each function module for example, in practical application, can as needed and by above-mentioned function distribution by
Different function modules is completed, i.e., the internal structure of equipment is divided into different function modules, described above complete to complete
Portion or partial function.In addition, the recognition methods of the identification device and user property for the user property that above-described embodiment provides is real
It applies example and belongs to same design, specific implementation process refers to embodiment of the method, and which is not described herein again.
Fig. 4 is the block diagram according to a kind of device 400 of identification for user property shown in an exemplary embodiment.Example
Such as, device 400 may be provided as a server.With reference to Fig. 4, device 400 includes processing component 422, further comprises one
A or multiple processors and as the memory resource representated by memory 432, can holding by processing component 422 for storing
Capable instruction, such as application program.The application program stored in memory 432 can include it is one or more each
Corresponding to the module of one group of instruction.In addition, processing component 422 is configured as execute instruction, to perform the knowledge of above-mentioned user property
Other method.
Device 400 can also include the power management that a power supply module 426 is configured as executive device 400, and one has
Line or radio network interface 450 are configured as device 400 being connected to network and input and output (I/O) interface 458.Dress
Putting 400 can operate based on the operating system for being stored in memory 432, such as Windows ServerTM, Mac OS XTM,
UnixTM,LinuxTM, FreeBSDTMIt is or similar.
One of ordinary skill in the art will appreciate that hardware can be passed through by realizing all or part of step of above-described embodiment
It completes, relevant hardware can also be instructed to complete by program, the program can be stored in a kind of computer-readable
In storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all the present invention spirit and
Within principle, any modification, equivalent replacement, improvement and so on should all be included in the protection scope of the present invention.