CN110111143A

CN110111143A - A kind of control method and control device for establishing mobile end subscriber portrait

Info

Publication number: CN110111143A
Application number: CN201910351389.7A
Authority: CN
Inventors: 杨洋; 包林基; 郑纪伟; 王维
Original assignee: Shanghai 2345 Mobile Technology Co Ltd
Current assignee: Shanghai 2345 Mobile Technology Co Ltd
Priority date: 2019-04-28
Filing date: 2019-04-28
Publication date: 2019-08-09

Abstract

The present invention discloses a kind of control method for establishing mobile end subscriber portrait, include the following steps: that vectorization labeled data establishes training pattern N to a. based on one or more, wherein, the vectorization labeled data includes at least the vectorization labeled data of user's longitude and latitude label；B. portrait prediction is carried out based on user data M of the training pattern N to K mobile end subscriber, determines training pattern N+S+1, wherein K > 1, S >=2；C. portrait prediction is carried out based on user data of the training pattern N+S+1 to one or more mobile end subscribers, and user's portrait after prediction is stored in HBASE.The present invention can efficiently, accurately establish user's portrait, the data such as the APP information, cell phone apparatus information, the browsing record that are used by analysis user installation, it is screened out from it labeled data, establish machine learning model training, it goes into training again after being modified to result, it iterates, increases training data quantity, improves quality, improve coverage rate, the accuracy rate of portrait.

Description

A kind of control method and control device for establishing mobile end subscriber portrait

Technical field

The invention belongs to mobile Internet field, in particular to a kind of control method for establishing mobile end subscriber portrait and Control device.

Background technique

User's portrait, also known as user role (Persona) are mainly used for delineating target user, connection user's demand and set Count direction.Establishing user's portrait can help service side to be better understood by the user of oneself, adjust business game, improve Service Quality Amount.It is widely used in each field.

With the fast development of internet, various mass datas are produced, thus when having welcome big data and artificial intelligence In generation, will fundamentally overturn remodeling all trades and professions, draw a portrait including user.But since Internet scene is changeable, data are more New fast, amount is big and complex, fail to be formed at present a set of maturation for Internet user, especially mobile interchange network users Portrait method for building up and technology.

Currently, user's portrait has the disadvantage that in the prior art, firstly, being built according to the inefficient measure such as user investigation questionnaire Vertical portrait, the Internet user that can not be suitable for magnanimity draws a portrait, moreover, some portrait labels are difficult to arrange by such as questionnaire survey etc. Acquisition, such as occupation, marriage, educational background, income level are applied, meanwhile, obtainable labeled data is usually very rare, and data ring Border may be inconsistent, and it is larger that the portrait for mass users establishes error.

At present in existing technology, there is no a kind of control methods for establishing mobile end subscriber portrait specifically to lack A kind of control method and control device for establishing mobile end subscriber portrait.

Summary of the invention

For technological deficiency of the existing technology, establish what mobile end subscriber was drawn a portrait the object of the present invention is to provide a kind of Control method includes the following steps: for realizing the calculating of user's portrait of one or more mobile terminals

A. vectorization labeled data establishes training pattern N based on one or more, wherein the vectorization labeled data is extremely It less include the vectorization labeled data of user's longitude and latitude label；

B. portrait prediction is carried out based on user data M of the training pattern N to K mobile end subscriber, determines training pattern N+S + 1, wherein K > 1, S >=2；

C. portrait prediction is carried out based on user data of the training pattern N+S+1 to one or more mobile end subscribers, and will User's portrait after prediction is stored in HBASE.

Preferably, determine that the vectorization labeled data of user's longitude and latitude label includes the following steps:

A1: the longitude and latitude cluster labels of user are obtained；

A2: the cartographic classification label of user is obtained；

A3: the cartographic classification label vectorization of user is handled, determines the vectorization labeled data of user's longitude and latitude label.

Preferably, the step a1 includes the following steps:

A11: user's latitude and longitude information is obtained, and user's latitude and longitude information is expressed as (e, f)；

A12: clustering user (e, f) according to clustering algorithm, determines the longitude and latitude cluster labels L of user, wherein institute Clustering algorithm is stated including at least kmeans algorithm or DBSCAN algorithm.

Preferably, the step a2 includes the following steps:

A21: cartographic classification data are obtained；

A22: determining multiple in nearest cartographic classification data with user's longitude and latitude (e, f) in multiple periods Point, and using the label of points multiple in cartographic classification data as the cartographic classification label of user.

Preferably, the step a3 includes the following steps:

A31: the label of points multiple in cartographic classification data is integrally formed to the behavior of user location label in chronological order Track；

A32: the action trail vectorization of user location label is handled, and determines the vectorization mark of user's longitude and latitude label Data.

Preferably, the training pattern N is established as follows:

I: the user data of one or more mobile end subscribers is obtained；

Ii: one or more mark numbers in the user data of one or more mobile end subscribers are determined based on characterization rules According to；

Iii: carrying out vectorization processing for one or more of labeled data, determines that one or more vectorization marks number According to；

Iv: vectorization labeled data establishes training pattern N based on one or more.

Preferably, further include step i ' before the step i: obtaining user's authorization of mobile end subscriber.

Preferably, the characterization rules are determined as follows:

Ii1: creation classifier；

Ii2: training classifier；

Ii3: prediction result is obtained；

Ii4: accuracy rate and recall rate are calculated.

Preferably, the vectorization processing includes at least tf-idf representation.

Preferably, the training pattern N is included at least such as any one of drag:

LR model；

NB model；

Integrated model；Or

Return correlation model.

Preferably, the step b includes the following steps:

B1. portrait prediction is carried out based on user data M of the training pattern N to K mobile end subscriber, it is determining a with the K The T portrait label that the user data M of mobile end subscriber matches, wherein K > 1, T > 1；

B2. the P user data M+1 in T portrait label corresponding ground is determined based on T portrait label, wherein T > 1, P >= 1；

B3. training pattern N+1 is determined based on P user data M+1 and training pattern N, and training pattern N+1 is replaced Training pattern N in step a, wherein P >=1；

B4. S step a to step c is repeated, determines training pattern N+S+1.

Preferably, the user data further includes users' mobile end installation or the APP name list used, cell phone apparatus Information and browser record.

According to another aspect of the present invention, a kind of control device for establishing mobile end subscriber portrait is provided, is used for Realize the calculating of user's portrait of one or more mobile terminals, including following device:

First processing unit: vectorization labeled data establishes training pattern N based on one or more；

Second processing device: carrying out portrait prediction based on user data M of the training pattern N to K mobile end subscriber, determines Training pattern N+S+1, wherein K > 1, S >=2；

Third processing unit: picture is carried out based on user data of the training pattern N+S+1 to one or more mobile end subscribers It is stored in HBASE as prediction, and by user's portrait after prediction.

Preferably, first processing unit includes:

First acquisition device: the longitude and latitude cluster labels of user are obtained；

Second acquisition device: the cartographic classification label of user is obtained；

First determining device: the cartographic classification label vectorization of user is handled, determines the vector of user's longitude and latitude label Change labeled data.

Preferably, first acquisition device includes:

Third acquisition device: user's latitude and longitude information is obtained, and user's latitude and longitude information is expressed as (e, f)；

Second determining device: user (e, f) is clustered according to clustering algorithm, determines the longitude and latitude cluster labels of user L。

Preferably, second acquisition device includes:

4th acquisition device: cartographic classification data are obtained；

Third determining device: user's longitude and latitude (e, f) is apart from nearest cartographic classification data in determining and multiple periods In multiple points, and using the label of points multiple in cartographic classification data as the cartographic classification label of user.

Preferably, the first determining device includes:

8th processing unit: the label of points multiple in cartographic classification data is integrally formed user location mark in chronological order The action trail of label；

9th determining device: the action trail vectorization of user location label is handled, determines user's longitude and latitude label Vectorization labeled data.

Preferably, further includes:

5th acquisition device: the user data of one or more mobile end subscribers is obtained；

4th determining device: determined based on characterization rules one in the user data of one or more mobile end subscribers or Multiple labeled data；

5th determining device: carrying out vectorization processing for one or more of labeled data, determine it is one or more to Quantify labeled data；

Fourth process device: vectorization labeled data establishes training pattern N based on one or more.

Preferably, further include the 6th acquisition device: obtaining user's authorization of mobile end subscriber.

Preferably, the 4th determining device includes:

5th processing unit: creation classifier；

6th processing unit: training classifier；

7th acquisition device: prediction result is obtained；

First computing device: accuracy rate and recall rate are calculated.

Preferably, the second processing device includes:

6th determining device: carrying out portrait prediction based on user data M of the training pattern N to K mobile end subscriber, determines The T portrait label to match with the user data M of described K mobile end subscriber, wherein K > 1, T > 1；

7th determining device: based on T portrait label determine corresponding to T portrait label P user data M+1, In, T > 1, P >=1；

7th processing unit: training pattern N+1 is determined based on P user data M+1 and training pattern N, and will be trained Training pattern N in model N+1 replacement step a, wherein P >=1；

8th determining device: repeating S step a to step c, determines training pattern N+S+1.

The present invention discloses a kind of control method for establishing mobile end subscriber portrait, i.e., a kind of mobile end subscriber portrait of foundation Method can efficiently, accurately establish user's portrait.APP information, the cell phone apparatus information used by analyzing user installation The data such as (brand and model, memory etc.), browsing record, therefrom go out labeled data with certain Rules Filtering, establish machine learning mould Type training, goes into training again after being modified to result, iterates, and increases training data quantity, improves quality, improves picture Coverage rate, the accuracy rate of picture.

Detailed description of the invention

Upon reading the detailed description of non-limiting embodiments with reference to the following drawings, other feature of the invention, Objects and advantages will become more apparent upon:

Fig. 1 shows a specific embodiment of the invention, a kind of tool of control method that establishing mobile end subscriber portrait Body flow diagram；

Fig. 2 shows the first embodiment of the present invention, determine the vectorization labeled data of user's longitude and latitude label Idiographic flow schematic diagram；

Fig. 3 shows the second embodiment of the present invention, obtains the detailed process signal of the longitude and latitude cluster labels of user Figure；

Fig. 4 shows the third embodiment of the present invention, obtains the idiographic flow schematic diagram of the cartographic classification label of user；

Fig. 5 shows the fourth embodiment of the present invention, establishes the idiographic flow schematic diagram of the training pattern N；

Fig. 6 shows the fifth embodiment of the present invention, determines the idiographic flow schematic diagram of the characterization rules；

Fig. 7 shows the sixth embodiment of the present invention, based on training pattern N to the user data M of K mobile end subscriber Portrait prediction is carried out, determines the idiographic flow schematic diagram of training pattern N+S+1；

Fig. 8 shows another embodiment of the present invention, a kind of control device for establishing mobile end subscriber portrait Module connection diagram；

Fig. 9 shows the seventh embodiment of the present invention, a kind of module of control device that establishing mobile end subscriber portrait Connection schematic diagram；And

Figure 10 shows the eighth embodiment of the present invention, and the cartographic classification label vectorization of user is handled, and determines and uses The idiographic flow schematic diagram of the vectorization labeled data of family longitude and latitude label.

Specific embodiment

In order to preferably technical solution of the present invention be made clearly to show, the present invention is made into one with reference to the accompanying drawing Walk explanation.

It will be appreciated by those skilled in the art that in view of the deficiencies of the prior art, the invention discloses a kind of mobile end subscribers of foundation The method of portrait can efficiently, accurately establish user's portrait, specifically, the present invention according to the related data of users' mobile end Compared with the existing technology, the vectorization labeled data of user's longitude and latitude label has been used more accurately to establish user's portrait, and User's portrait after prediction is stored in HBASE, the more unknown rules of people's discovery are helped by machine learning, are passed through Artificial alteration ruler can improve the performance of machine learning again, and the two is mutually promoted, and more comprehensively, accurately draw a portrait to obtain.

The present invention uses machine learning method, reduces manpower demand, can be used for being mass produced deployment, does not need additional Labeled data, labeled data can need obtain in processed unlabeled data from magnanimity, since labeled data is never It being obtained in labeled data, labeled data can increase with increasing for unlabeled data, and the data environment of the two is consistent, because The machine learning model that this training obtains eliminates the unfavorable factor of environmental transport and transfer, and accuracy rate is guaranteed.

Fig. 1 shows a specific embodiment of the invention, a kind of tool of control method that establishing mobile end subscriber portrait Body flow diagram specifically includes the following steps:

Firstly, entering step S101, vectorization labeled data establishes training pattern N based on one or more, wherein described Vectorization labeled data includes at least the vectorization labeled data of user's longitude and latitude label, it will be appreciated by those skilled in the art that described Vectorization labeled data is mainly used for obtaining some data with vectorization by some master datas of user, and then establishes Training pattern N out, the vectorization labeled data will be further described through in Fig. 2, and it will not be described here, further, Data mining is the cross method of manually intelligence, machine learning, statistics and database in relatively large-scale data set The calculating process of discovery mode.Training data refers to the data that training data mining model is used in data mining process.Training Data selection generally has claimed below: data sample is as big as possible, data are diversified, and data sample quality is higher, training data Data in (Train Data) i.e. data mining process for data mining model building.In data mining process, in addition to There are also test data (Test Data) for training data, i.e., construct for detection model, this data is only used in model testing, Accuracy rate for assessment models.Absolutely not it is allowed for model construction process, otherwise will lead to transition fitting.Verify data (Validation Data): it is optional, it constructs, may be reused for submodel.It, can be using when data set is smaller Method makes up this disadvantage, such as bootstrap, and in the present invention, the training pattern N includes at least LR model, NB mould Type, integrated model return correlation model.

Further, the vectorization labeled data includes at least the vectorization labeled data of user's longitude and latitude label, Determine that the vectorization labeled data of user's longitude and latitude label will be further described through in Fig. 2 into Fig. 4, it is not superfluous herein It states.

Then, step S102 is executed, draw a portrait based on user data M of the training pattern N to K mobile end subscriber pre- It surveys, determines training pattern N+S+1, wherein K > 1, S >=2, in such embodiments, this step are primarily based on N pairs of training pattern The user data M of K mobile end subscriber carries out portrait prediction, and the determining user data M with described K mobile end subscriber matches T portrait label, wherein K > 1, T > 1 is then based on T portrait label and determines the P user in T portrait label corresponding ground Data M+1, wherein and then T > 1, P >=1 determine training pattern N+1 based on P user data M+1 and training pattern N, and By the training pattern N in training pattern N+1 replacement step a, wherein P >=1 finally repeats S step a to step c, really Determine training pattern N+S+1, these will be further described through in subsequent embodiment of the invention, and it will not be described here.

Finally, portrait prediction is carried out based on user data of the training pattern N+S+1 to one or more mobile end subscribers, and User's portrait after prediction is stored in HBASE, the training pattern N+S+1 is final mask, by by one Or the user data of multiple mobile terminals user carries out portrait prediction in final mask, and prediction result is stored in HBASE, The HBASE is a kind of unstructured data storage mode, will realize that more quick modification is new using this storage mode User, be relatively suitble to this inquiry of user's portrait few, modify more projects, especially embodiment shown in the present invention.

Fig. 2 shows the first embodiment of the present invention, determine the vectorization labeled data of user's longitude and latitude label Idiographic flow schematic diagram, it will be appreciated by those skilled in the art that establishing training in order to more comprehensive, complete, accurate Model preferably uses multiple vectorization labeled data, and in these vectorization labeled data, have plenty of from users' mobile end It obtains, has plenty of through cell phone apparatus information in the APP name list installed or used, besides being recorded by browser It obtains, by obtain vectorization labeled data, and then establishing training pattern for the progress vectorization expression of these data, and In a preferred embodiment, the vectorization labeled data that the training pattern needs user's longitude and latitude label is established, and Fig. 2 As determine that the corresponding manner of the vectorization labeled data of user's longitude and latitude label specifically includes the following steps:

Firstly, entering step S1011, the longitude and latitude cluster labels of user are obtained, in such embodiments, it is necessary first to User's latitude and longitude information is obtained, and user's latitude and longitude information is expressed as (e, f), and then according to clustering algorithm to user (e, f) It is clustered, determines the longitude and latitude cluster labels L of user, these will do in the preferred embodiment as shown in figure 3 and further retouch It states, it will not be described here.

Then, step S1012 is executed, the cartographic classification label of user is obtained, in such embodiments, obtains ground first Figure classification data, it is then determining with point of the user's longitude and latitude (e, f) in nearest cartographic classification data, and by cartographic classification Cartographic classification label of the label of the point as user in data, these will do further in the preferred embodiment shown in Fig. 4 Ground description, it will not be described here.

Finally, entering step S1013, the cartographic classification label vectorization of user is handled, determines user's longitude and latitude label Vectorization labeled data, in such embodiments,

Fig. 3 shows the second embodiment of the present invention, obtains the detailed process signal of the longitude and latitude cluster labels of user Figure, it will be appreciated by those skilled in the art that also or other modes are obtained and used by gps function in user terminal or data network Currently or previously the longitude in a period of time and latitude information specifically include the following steps: at family

Firstly, entering step S10111, user's latitude and longitude information is obtained, and user's latitude and longitude information is expressed as (e, f), In such embodiments, it is preferable to set an integer value k, K > 0, i.e. 0-K are expressed as all labels of cluster.

Then, S10112 is entered step, user (e, f) is clustered according to clustering algorithm, determines the longitude and latitude of user Cluster labels L, wherein the clustering algorithm includes at least kmeans algorithm or DBSCAN algorithm, further, according to Kmeans DBSCAN scheduling algorithm clusters user (e, f), and L can be calculated in each user, and L belongs in 0-K One number, L are the longitude and latitude cluster labels of user, kmeans the DBSCAN algorithm is currently available technology, herein It will not go into details.

Fig. 4 shows the third embodiment of the present invention, obtains the idiographic flow schematic diagram of the cartographic classification label of user It will be appreciated by those skilled in the art that including the following steps:

Firstly, entering step S10121, obtaining cartographic classification data in such embodiments preferably will be in map All positions are averagely divided into multiple point datas, i.e., the distance between data of each point are almost identical, it is possible to further It refine to mono- point data of every 10km, or mono- point data of every 1000km, and the set of all these point datas is The cartographic classification data, it will be appreciated by those skilled in the art that above-mentioned mode classification be in such a way that distance is classified, and In other embodiments, it can also classify in the way of job site, for example,

School, hospital, hotel etc., this does not affect a specific embodiment of the invention, and it will not be described here.

Then, S10122 is entered step, it is determining to divide with user's longitude and latitude (e, f) in multiple periods apart from nearest map Multiple points in class data, and using the label of points multiple in cartographic classification data as the cartographic classification label of user, in this way Embodiment in, acquisition cartographic classification data first, such as in a preferred embodiment, inquire position (23.6, 115.6) place is a primary school, then the tag along sort of the position is " school ", disclosed data have current some navigation app The map datum of offer, such as Baidu, Gao De, search with user's longitude and latitude (e, f) distance it is nearest or it is in certain distance certain Point in a cartographic classification data if the user is in school in a period, and is in residential building in a period, And it is in market in another period, then the tag along sort of user is the label of the multiple points of cartographic classification data, i.e., will Cartographic classification label of the label of multiple points as user in cartographic classification data.

Figure 10 shows the eighth embodiment of the present invention, and the cartographic classification label vectorization of user is handled, and determines and uses The idiographic flow schematic diagram of the vectorization labeled data of family longitude and latitude label specifically includes the following steps:

Firstly, entering step S10131, the label of points multiple in cartographic classification data is integrally formed use in chronological order The action trail of family location tags, in the third embodiment of the present invention, if the user is in school in a period, and It is in residential building in a period, and is in market in another period, then the tag along sort of user is map point The label of the multiple points of class data that is, using the label of points multiple in cartographic classification data as the cartographic classification label of user, and is made For the eighth embodiment of the present invention, the label of points multiple in cartographic classification data is integrally formed user location mark in chronological order The action trail of label integrates the user location sequentially in time, and then forms the motion profile of user, art technology Personnel understand, can be seen that user's occupation, interest etc. feature from the action trail of user, are beneficial to more accurate foundation User's portrait.

Then, S10131 is entered step, the action trail vectorization of user location label is handled, determines user's longitude and latitude The vectorization labeled data of label, it will be appreciated by those skilled in the art that in a preferred embodiment, it is true according to IP address The city for determining user obtains the cartographic classification information POI in the city, so according to the longitude and latitude at user's a certain moment and POI away from From the location tags (such as school, hospital, factory) for judging user at this time, further, by the position of user in a period of time Label is integrally formed the action trail (such as house -> factory -> food and drink -> house) of user location label in chronological order, by user It is clustered after action trail vectorization, obtains cluster labels, calculated user's longitude and latitude interior for a period of time, be averaged, count The location tags for calculating average longitude and latitude, according to calculation of longitude & latitude cluster labels.

Fig. 5 shows the fourth embodiment of the present invention, establishes the idiographic flow schematic diagram of the training pattern N, ability Field technique personnel understand that the training pattern N is established by multiple vectorization labeled data, specifically, the training pattern N It establishes as follows:

Firstly, entering step S201, user's authorization of mobile end subscriber is obtained, in such embodiments, it would be desirable to Users' mobile end installation or the APP name list used, cell phone apparatus information and browser record are obtained, i.e., we need The authorization for obtaining user terminal could obtain these information, i.e., in step s 201, we are firstly the need of the mobile end subscriber of acquisition User authorization.

Then, S202 is entered step, the user data of one or more mobile end subscribers is obtained, the user data includes APP name list, cell phone apparatus information and the browser record that users' mobile end is installed or used, the cell phone apparatus letter Breath is the brand and model of mobile phone, mobile phone EMS memory size etc., and browses the webpage letter for being recorded as browsing in user's proximal segment time The cache information of breath.

And then, S203 is entered step, in the user data that one or more mobile end subscribers are determined based on characterization rules One or more labeled data, the determination of the characterization rules will be further described through in subsequent Fig. 6, refuse herein It repeats, and the labeled data is the label information in user data, in a preferred embodiment, gets multiple shiftings There is the app application program of A doctor in the user data of moved end user, i.e., using A doctor app as doctor this general orientation Labeled data.

Subsequently, enter step S204, by one or more of labeled data carry out vectorization processing, determine one or Multiple vectorization labeled data, a text data refer to an article perhaps one section of word or a word, this text Data are commonly referred to as document or text.Our usual texts are showed with the expression way of people, are a stream Data, time series data.If we are handled text data with computer, text data must be just expressed as The mode that computer capacity understands, the vectorization processing include at least tf-idf representation.

It will be appreciated by those skilled in the art that vectorization processing is in addition to tf-idf representation, there are also one-hot representation, Tf (term-frequency) representation, tf-idf (term frequency-inverse document frequency), Text data is first concentrated unduplicated word to extract by one-hot representation, obtains the vocabulary that a size is V, so An article is indicated with the vector of V dimension afterwards, and 1 in d-th of dimension in vector indicates d-th of word in vocabulary It appears in this article, for example gives a data set, extract not repeated word therein first and (do not consider word in text The sequencing occurred in this), the vocabulary containing 7 words is obtained, then text data set translates into one-hot Matrix, if text data set is too big, there may be thousands of a words in obtained vocabulary, in this way can text dimension It is too big, it not only results in and calculates time increase, and bring Sparse Problems (most elements are all 0 in one-hot matrix). Therefore, we can exclude the very little word of those frequency of occurrence usually when calculating vocabulary, to reduce text dimension Degree.

And whether tf representation is only concerned word different from one-hot representation and occurs, tf (term-frequency) table Show that method is also concerned about the number of word appearance, therefore time that its corresponding word of each element representation occurs in article in tf matrix The frequency of total degree in number/article, i.e. word in article.tf-idf(term frequency–inverse document Frequency), frequency of occurrence of the word in article is not only allowed for, it is also contemplated that it goes out what entire text data was concentrated Occurrence number.The main thought of TF-IDF is: if the frequency TF high that some word or phrase occur in an article, and at it Seldom occur in his article, then it is assumed that this word or phrase have good class discrimination ability.Idf is exactly for measuring one The frequency of occurrences of the word in all texts, its calculation formula is: idfi=log | D | | { j:ti ∈ dj }, | D | indicate total article Number, denominator indicate the article number comprising word i.In general, denominator is that 0 in denominator can add 1, i.e. idfi=log in order to prevent | D | 1+ | { j:ti ∈ dj } |, log has been used in the calculation formula of idf, has been because if the frequency of occurrence of some word in this text Very little, will make | D | the value obtained divided by this number is very big, directly idf can be made to the shadow of calculated result multiplied by tf Sound is very big, therefore takes logarithm to inhibit the influence of idf.The value at bottom is not related with text data set, because bottom takes Difference, the relative importance of each word can't be changed.But general value is 10.

The above method has all converted the text to vector with certain reasonability, but there are problems that " semantic gap ". Such as one-hot representation, it is assumed that only exist a word Mike in a text, a notes and comments on poetry is only existed in another text Cylinder, then the two articles can be expressed as [0,0,0,0,1,0,0,0], [1,0,0,0,0,0,0,0], although in this way Word in two texts is similar, but obtained distance is far.Therefore some deep learning methods are by text by word list It is shown as the real-valued vectors of n dimension, similar [0.792, -0.177, -0.107,0.109, -0.542 ...] form.So each text These term vectors can be based on, the real-valued vectors of m dimension are extracted as by certain mode.The text indicated by this method, then By Euclidean distance or COS distance method calculated distance, the distance phase that semantic similar word can be made to obtain Closely.

Finally, entering step S205, vectorization labeled data establishes training pattern N based on one or more.This field skill Art personnel understand that machine learning model is divided into supervised learning and unsupervised learning two major classes according to workable data type.Prison Educational inspector practise mainly include for classify and for recurrence model: where classification include linear classifier (such as LR), supporting vector Machine (SVM), naive Bayesian (NB), k nearest neighbor (KNN), decision tree (DT), integrated model (RF/GDBT etc.), recurrence includes: line Property recurrence, support vector machines (SVM), k nearest neighbor (KNN), regression tree (DT), integrated model (ExtraTrees/RF/GDBT), and Unsupervised learning specifically includes that data clusters (K-means)/Data Dimensionality Reduction (PCA) etc., i.e., the described training pattern N at least wrap It includes LR model, NB model, integrated model or returns correlation model.

Fig. 6 shows the fifth embodiment of the present invention, determines the idiographic flow schematic diagrams of the characterization rules specifically, Include the following steps:

Firstly, entering step S2031, classifier is created, one group has the training data label of label to show these data The generic of (observation), disaggregated model is according to these training datas, the model parameter of training oneself, learn out one be suitble to this The classifier of group data, as creation classifier.

Then, S2032 is entered step, training classifier, when there is new data to need to carry out classification judgement, so that it may by this group Data give the classifier succeeded in school as input and are judged and (obtain label), and the data that training pattern has marked are used to It establishes model discovery rule and subsequently enters step S2033, obtain prediction result, the data marked, only mark It conceals, then gives trained model, comparison result and original mark judge the learning ability of the model.

Finally, S2034 is entered step, and calculating accuracy rate and recall rate, accuracy rate, all predictions pair, positive class is pre- Positive class (TP) is surveyed into, negative class is predicted into negative class (TN), accuracy rate=(TP+TN)/total quantity, accurate rate: by taking two classification as an example, It predicts that the sample being positive is genuine positive sample, the prediction of positive class is positive class (TP), the prediction of negative class is positive class (FP), recall rate: How many is predicted correctly, the prediction of positive class at positive class (TP), to predict at negative class (FN) positive class for direct proportion in sample.

Fig. 7 shows the sixth embodiment of the present invention, based on training pattern N to the user data M of K mobile end subscriber Portrait prediction is carried out, determines the idiographic flow schematic diagram of training pattern N+S+1, specifically, is included the following steps:

Firstly, entering step S1021, draw a portrait based on user data M of the training pattern N to K mobile end subscriber pre- It surveys, the T portrait label that the determining user data M with described K mobile end subscriber matches, wherein K > 1, T > 1, in this way Embodiment in, the mobile end subscriber be it is multiple, in a preferred embodiment, be based on training pattern, multiple numbers of users APP application program in containing B doctor, i.e., the described user data M be B doctor, and the portrait label be doctor this Big scope.

Then, enter step S1022, based on T portrait label determine corresponding to T portrait label P user data M + 1, wherein T > 1, P >=1, the preferred embodiment in conjunction with shown in step S1021 have multiple users to be predicted by step S1021 It for doctor, and in the user terminal of these doctors, and finds there are this app application program of C doctor, then the C doctor app The as described user data M+1.

And then, training pattern N+1 is determined based on P user data M+1 and training pattern N, and by training pattern N+1 Training pattern N in replacement step a, wherein the C doctor app is added to trained mould in such embodiments by P >=1 In type N, as training pattern N+1.

Finally, repeating S step a to step c, training pattern N+S+1 is determined, it will be appreciated by those skilled in the art that root Database is corrected according to result manual intervention, increase or modifies the rule prestored, is such as judged according to the APP application program of B doctor After the people is doctor out, it is found that there are also this app application programs of C doctor by most of doctor, therefore by this app application journey of C doctor Sequence is added in doctor's label, while choosing the correct user data of prediction and training data is added, and repeats S step number time, directly To rule and model convergence, the S is preferably 2 times to 6 times.

Fig. 8 shows another embodiment of the present invention, a kind of control device for establishing mobile end subscriber portrait Module connection diagram, including following device:

First processing unit 1, vectorization labeled data establishes training pattern N based on one or more, at described first The working principle for managing device 1 can be with reference to step S101 in Fig. 1, and it will not be described here.

Second processing device 2: portrait prediction is carried out based on user data M of the training pattern N to K mobile end subscriber, really Determine training pattern N+S+1, wherein the working principle of K > 1, S >=2, the second processing device 2 can be with reference to steps in Fig. 1 S102, it will not be described here.

Third processing unit 3: picture is carried out based on user data of the training pattern N+S+1 to one or more mobile end subscribers It is stored in HBASE as prediction, and by user's portrait after prediction, the working principle of the third processing unit 3 can be with With reference to step S103 in Fig. 1, it will not be described here.

Further, first processing unit 1 includes:

First acquisition device 11: the longitude and latitude cluster labels of user, the working principle of first acquisition device 11 are obtained Can be with reference to step S1011 in Fig. 2, it will not be described here.

Second acquisition device 12: the cartographic classification label of user is obtained, the working principle of second acquisition device 12 can With with reference to step S1012 in Fig. 2, it will not be described here.

First determining device 13: the cartographic classification label vectorization of user is handled, determine user's longitude and latitude label to Quantify labeled data, the working principle of first determining device 13 can be with reference to step S1013 in Fig. 2, and it will not be described here.

Further, first acquisition device 11 includes third acquisition device 111: user's latitude and longitude information is obtained, and User's latitude and longitude information is expressed as (e, f), the working principle of the third acquisition device 111 can refer to step S10111, It will not be described here.

Further, further include the second determining device 112: user (e, f) being clustered according to clustering algorithm, determine and use The working principle of the longitude and latitude cluster labels L at family, second determining device 112 can refer to step S10112, refuse herein It repeats.

Further, second acquisition device 12 includes the 4th acquisition device 121: cartographic classification data are obtained, it is described The working principle of 4th acquisition device 121 can refer to step S10121, and it will not be described here.

Further, further include third determining device 122: determining and user's longitude and latitude (e, f) is apart from nearest map Point in classification data, and using the label of the point in cartographic classification data as the cartographic classification label of user, the third is true The working principle for determining device 122 can be with reference to step S10122, and it will not be described here.

Further, the second processing device 2 includes:

6th determining device 21: portrait prediction is carried out based on user data M of the training pattern N to K mobile end subscriber, really The T portrait label that the fixed user data M with described K mobile end subscriber matches, wherein K > 1, T > 1, the described 6th determines The working principle of device 21 can refer to step S1021, and it will not be described here.

Further, further include the 7th determining device 22: based on T portrait label determine corresponding to T portrait label P A user data M+1, wherein the working principle of T > 1, P >=1, the 7th determining device 22 can refer to step S1022, It will not go into details for this.

Further, further include the 7th processing unit 23: training is determined based on P user data M+1 and training pattern N Model N+1, and by the training pattern N in training pattern N+1 replacement step a, wherein P >=1, the 7th processing unit 23 Working principle can refer to step S1023, and it will not be described here.

Further, further include the 8th determining device 24: repeating S step a to step c, determine training pattern N+S + 1, the working principle of the 8th determining device 24 can refer to step S1024, and it will not be described here.

Further, the first determining device 13 includes: the 8th processing unit 131: by points multiple in cartographic classification data Label is integrally formed the action trail of user location label in chronological order, and the working principle of the 8th processing unit 131 can To refer to step S10131, it will not be described here.

Further, first determining device 13 further includes the 9th determining device 132, by the behavior of user location label Track vectorization processing, determines the vectorization labeled data of user's longitude and latitude label, and the work of the 8th processing unit 132 is former Reason can refer to step S10132, and it will not be described here.

Fig. 9 shows the seventh embodiment of the present invention, a kind of module of control device that establishing mobile end subscriber portrait Connection schematic diagram, it will be appreciated by those skilled in the art that further including the 5th acquisition device 4: obtaining one or more mobile end subscribers The working principle of user data, the 5th acquisition device 4 can refer to step S202, and it will not be described here.

Further, further include the 4th determining device 5: determining the use of one or more mobile end subscribers based on characterization rules The working principle of one or more labeled data in user data, the 4th determining device 5 can refer to step S203, herein It will not go into details.

Further, further include the 5th determining device 6: one or more of labeled data be subjected to vectorization processing, Determine one or more vectorization labeled data, the working principle of the 5th determining device 6 can refer to step S204, herein It will not go into details.

Further, further include fourth process device 7: vectorization labeled data establishes training pattern based on one or more The working principle of N, the fourth process device 7 can refer to step S205, and it will not be described here.

Further, further include the 6th acquisition device 8: obtaining user's authorization of mobile end subscriber, the fourth process dress The working principle for setting 7 can be with reference to step S201, and it will not be described here.

Further, the 4th determining device 5 includes the 5th processing unit 51: creation classifier, the 5th processing The working principle of device 51 can refer to step S2051, and it will not be described here.

It further, further include the 6th processing unit 52: training classifier, the working principle of the 6th processing unit 52 Step S2052 can be referred to, it will not be described here.

Further, further include the 7th acquisition device 53: obtaining prediction result, the work of the 7th acquisition device 53 is former Reason can refer to step S2053, and it will not be described here.

Further, further include the first computing device 54: calculating accuracy rate and recall rate, first computing device 54 Working principle can refer to step S2054, it will not be described here.

Specific embodiments of the present invention are described above.It is to be appreciated that the invention is not limited to above-mentioned Particular implementation, those skilled in the art can make various deformations or amendments within the scope of the claims, this not shadow Ring substantive content of the invention.

Claims

1. a kind of control method for establishing mobile end subscriber portrait, draw a portrait for realizing the user of one or more mobile terminals It calculates, which comprises the steps of:

A. vectorization labeled data establishes training pattern N based on one or more, wherein the vectorization labeled data at least wraps Include the vectorization labeled data of user's longitude and latitude label；

B. portrait prediction is carried out based on user data M of the training pattern N to K mobile end subscriber, determines training pattern N+S+1, Wherein, K > 1, S >=2；

C. portrait prediction is carried out based on user data of the training pattern N+S+1 to one or more mobile end subscribers, and will prediction After user portrait be stored in HBASE.

2. control method according to claim 1, which is characterized in that determine the vectorization mark of user's longitude and latitude label Note data include the following steps:

A1: the longitude and latitude cluster labels of user are obtained；

A2: the cartographic classification label of user is obtained；

3. control method according to claim 2, which is characterized in that the step a1 includes the following steps:

A12: clustering user (e, f) according to clustering algorithm, determines the longitude and latitude cluster labels L of user, wherein described poly- Class algorithm includes at least kmeans algorithm or DBSCAN algorithm.

4. control method according to claim 3, which is characterized in that the step a2 includes the following steps:

A21: cartographic classification data are obtained；

A22: multiple points of the user's longitude and latitude (e, f) in nearest cartographic classification data in determining and multiple periods, and Using the label of points multiple in cartographic classification data as the cartographic classification label of user.

5. control method according to claim 4, which is characterized in that the step a3 includes the following steps:

A31: the label of points multiple in cartographic classification data is integrally formed to the action trail of user location label in chronological order；

A32: the action trail vectorization of user location label is handled, and determines the vectorization mark number of user's longitude and latitude label According to.

6. control method according to claim 1, which is characterized in that the training pattern N is established as follows:

I: the user data of one or more mobile end subscribers is obtained；

Ii: one or more labeled data in the user data of one or more mobile end subscribers are determined based on characterization rules；

Iii: carrying out vectorization processing for one or more of labeled data, determines one or more vectorization labeled data；

7. control method according to claim 6, which is characterized in that before the step i, further include step i ': obtaining The user of mobile end subscriber is taken to authorize.

8. control method according to claim 6, which is characterized in that the characterization rules are determined as follows:

Ii1: creation classifier；

Ii2: training classifier；

Ii3: prediction result is obtained；

Ii4: accuracy rate and recall rate are calculated.

9. control method according to claim 6, which is characterized in that the vectorization processing, which includes at least tf-idf, to be indicated Method.

10. control method according to claim 6, which is characterized in that the training pattern N is included at least as in drag It is any:

LR model；

NB model；

Integrated model；Or

Return correlation model.

11. control method according to claim 1, which is characterized in that the step b includes the following steps:

B1. portrait prediction, the determining and K movement are carried out based on user data M of the training pattern N to K mobile end subscriber The T portrait label that the user data M of end subscriber matches, wherein K > 1, T > 1；

B2. the P user data M+1 in T portrait label corresponding ground is determined based on T portrait label, wherein T > 1, P >=1；

B3. training pattern N+1 is determined based on P user data M+1 and training pattern N, and by training pattern N+1 replacement step Training pattern N in a, wherein P >=1；

B4. S step a to step c is repeated, determines training pattern N+S+1.

12. control method according to any one of claim 1 to 11, which is characterized in that the user data further includes APP name list, cell phone apparatus information and the browser record that users' mobile end is installed or used.

13. a kind of control device for establishing mobile end subscriber portrait, draws a portrait for realizing the user of one or more mobile terminals Calculating, which is characterized in that including following device:

First processing unit (1): vectorization labeled data establishes training pattern N based on one or more；

Second processing device (2): carrying out portrait prediction based on user data M of the training pattern N to K mobile end subscriber, determines Training pattern N+S+1, wherein K > 1, S >=2；

Third processing unit (3): it is drawn a portrait based on user data of the training pattern N+S+1 to one or more mobile end subscribers Prediction, and user's portrait after prediction is stored in HBASE.

14. control device according to claim 13, which is characterized in that first processing unit (1) includes:

First acquisition device (11): the longitude and latitude cluster labels of user are obtained；

Second acquisition device (12): the cartographic classification label of user is obtained；

First determining device (13): the cartographic classification label vectorization of user is handled, determines the vector of user's longitude and latitude label Change labeled data.

15. control device according to claim 14, which is characterized in that first acquisition device (11) includes:

Third acquisition device (111): user's latitude and longitude information is obtained, and user's latitude and longitude information is expressed as (e, f)；

Second determining device (112): clustering user (e, f) according to clustering algorithm, determines the longitude and latitude cluster mark of user Sign L.

16. control device according to claim 15, which is characterized in that second acquisition device (12) includes:

4th acquisition device (121): cartographic classification data are obtained；

Third determining device (122): user's longitude and latitude (e, f) is apart from nearest cartographic classification data in determining and multiple periods In multiple points, and using the label of points multiple in cartographic classification data as the cartographic classification label of user.

17. control device according to claim 16, which is characterized in that the first determining device (13) includes:

8th processing unit (131): the label of points multiple in cartographic classification data is integrally formed user location in chronological order The action trail of label；

9th determining device (132): the action trail vectorization of user location label is handled, determines user's longitude and latitude label Vectorization labeled data.

18. control device according to claim 13, which is characterized in that further include:

5th acquisition device (4): the user data of one or more mobile end subscribers is obtained；

4th determining device (5): determined based on characterization rules one in the user data of one or more mobile end subscribers or Multiple labeled data；

5th determining device (6): carrying out vectorization processing for one or more of labeled data, determines one or more vector Change labeled data；

Fourth process device (7): vectorization labeled data establishes training pattern N based on one or more.

19. control device according to claim 18, which is characterized in that further include the 6th acquisition device (8): obtaining movement The user of end subscriber authorizes.

20. control device according to claim 18, which is characterized in that the 4th determining device (5) includes:

5th processing unit (51): creation classifier；

6th processing unit (52): training classifier；

7th acquisition device (53): prediction result is obtained；

First computing device (54): accuracy rate and recall rate are calculated.

21. control device according to claim 13, which is characterized in that the second processing device (2) includes:

6th determining device (21): carrying out portrait prediction based on user data M of the training pattern N to K mobile end subscriber, determines The T portrait label to match with the user data M of described K mobile end subscriber, wherein K > 1, T > 1；

7th determining device (22): based on T portrait label determine corresponding to T portrait label P user data M+1, In, T > 1, P >=1；

7th processing unit (23): training pattern N+1 is determined based on P user data M+1 and training pattern N, and will be trained Training pattern N in model N+1 replacement step a, wherein P >=1；

8th determining device (24): repeating S step a to step c, determines training pattern N+S+1.