CN108256548A - A kind of user's portrait depicting method and system based on Emoji service conditions - Google Patents

A kind of user's portrait depicting method and system based on Emoji service conditions Download PDF

Info

Publication number
CN108256548A
CN108256548A CN201711261393.1A CN201711261393A CN108256548A CN 108256548 A CN108256548 A CN 108256548A CN 201711261393 A CN201711261393 A CN 201711261393A CN 108256548 A CN108256548 A CN 108256548A
Authority
CN
China
Prior art keywords
emoji
user
portrait
model
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711261393.1A
Other languages
Chinese (zh)
Inventor
刘譞哲
陈震鹏
陆璇
马郓
黄罡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN201711261393.1A priority Critical patent/CN108256548A/en
Publication of CN108256548A publication Critical patent/CN108256548A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Abstract

The step of present invention offer a kind of user's portrait depicting method and system based on Emoji service conditions, this method, includes:The text data for obtaining the portrait information of a collection of user and inputting;The initial data of Emoji is extracted from the text data using regular expression;The Emoji that the user is obtained according to the initial data of the Emoji uses feature;The user is divided into training set and test set;Emoji by the use of the user in the training set uses feature as independent variable, and portrait information is as dependent variable, training pattern;The model trained is applied on the test set, the best model of evaluation index is selected and portrays model as final user's portrait.Emoji present invention utilizes user version portrays user's portrait, and the sensitive content inputted in text without analyzing user can protect privacy of user.

Description

A kind of user's portrait depicting method and system based on Emoji service conditions
Technical field
The invention belongs to software technology field, specifically a kind of user's portrait depicting method based on Emoji service conditions And system.
Background technology
In the big data epoch, user's portrait technology of portraying refers to enterprise by analyzing user data, so as to be sketched the contours for user Go out the process of a labeling model.The core work for portraying user's portrait is that user sticks " label ", and rational structure goes out use The virtual representations at family." label " can be various information such as gender, age, the religious belief of user.User's portrait is being looked forward to Tool has been widely used in industry.On the one hand, user's portrait contributes to enterprise to establish data assets, and mining data is worth, while can To carry out data trade, promote data circulation.On the other hand, user's portrait can help enterprise's progress market to see clearly, estimate city Field scale so as to comprehensive understanding user demand, realizes accurate positionin and the precision marketing of product.For example, online advertisement platform User can be utilized to draw a portrait and recommend the advertisement for best suiting user preference to user, so as to maximize ad click rate, expand advertisement Income.
Currently a popular user's portrait portrays that there are many technology, such as the text message that analysis user inputs in the application, point The photo that user uploads is analysed, analyzes the network pet name of user, analysis user mode of interaction etc. in webpage and application.Its In, most of enterprise is still portrayed by analyzing content of text input by user to carry out user's portrait.Enterprise uses nature language For the technology of speech processing, information retrieval etc. to analyze text input by user, extracting has the text of information content and taste spy Sign.Finally, by the use of the user of known personal considerations and their text feature as the dependent variable and independent variable of model, machine is used Device learns, the method for deep learning exercises supervision or semi-supervised learning, finally obtains the model for portraying user's portrait.For new User, enterprise need to only extract corresponding text feature, and in the user's portrait model being put into, model can export prediction Obtained user property.
But the text analyzing of this mainstream portray user portrait technology there are some drawbacks.First, these technologies are very The privacy of user is destroyed in big degree, because often comprising address, mailbox, telephone number, wealth in text input by user The sensitive contents such as information of being engaged in.Second, these technologies are not pervasive to language.Mainstream portrays user's portrait using text analyzing Method is all based on greatly English text to implement.In recent years, the work of natural language processing field finds that these are based on English text Originally the technology found out is implemented into relatively difficult on other language, and carving effect is poor.For example, English is cut using space Divide word, therefore can be easily using technologies such as Bag of Words, the Unigram in natural language processing.But day The language such as language, Chinese are simultaneously segmented without using space, therefore the process for extracting text feature is numerous and diverse and effect is poor.Even with Advanced participle technique successfully segments, and user can not be also carried out in high quality using the technology of the natural language processing of those mainstreams Portrait is portrayed.
The development of Emoji and universal brings new thinking for the text based user technology of portraying of drawing a portrait.First, Emoji is vivid, and without aphasis, strong by various countries user is had deep love for.Expression of crying is laughed at even in quilt in 2015《Oxford Dictionary》It is chosen as annual vocabulary.Secondly, as a kind of pervasive language, Emoji be by Unicode official definitions, can be easily It filters and extracts from text.The technology that user's portrait is portrayed is carried out compared to using content of text, utilizes Emoji's Service condition carries out having portrayed significant advantage.On the one hand, analysis Emoji will not relate to the sensitive information of user, substantially Degree improves the secret protection to user.On the other hand, Emoji extractions are simple, do not need to the cutting word technology of complexity to extract, and It is all widely used in every country.It means that transnational enterprise no longer needs all to design one to the language that each is related to Complicated, the completely new user of set, which draws a portrait, portrays technology, but directly can train one simply and to various countries using Emoji User's portrait that user is all suitable for portrays technology.Therefore, the present invention proposes a kind of user's portrait based on Emoji service conditions Depicting method.
Invention content
The problem of user's portrait is portrayed is carried out according to content of text for existing, the purpose of the present invention is to propose to A kind of user's portrait depicting method and system based on Emoji service conditions, the Emoji that user version is utilized draw a portrait to user It portrays, the sensitive content inputted in text without analyzing user can protect privacy of user.
In order to achieve the above objectives, the technical solution adopted by the present invention is as follows:
A kind of user's portrait depicting method based on Emoji service conditions, step include:
The text data for obtaining the portrait information of a collection of user and inputting;
The initial data of Emoji is extracted from the text data using regular expression;
The Emoji that the user is obtained according to the initial data of the Emoji uses feature;
The user is divided into training set and test set;
Emoji by the use of the user in the training set uses feature as independent variable, portrait information as dependent variable, Training pattern;
The model trained is applied on the test set, selects the best model of evaluation index as final user Portrait portrays model.
Further, the portrait information includes the information such as age, the gender of user.
Further, go out the regular expression using the Emoji code constructions of Unicode official definitions.
Further, the Emoji includes Emoji frequency of use feature using feature, Emoji uses preference profiles, Emoji emotion intent features.
Further, the Emoji frequency of use feature include the ratio accounted in text using the text of Emoji, one In sentence Emoji using in number and a sentence use multiple Emoji when pattern.
Further, the Emoji uses preference profiles to include the Colour selection of Emoji, the use of Emoji selects, The continuous use selection of Emoji.
Further, the Emoji emotions intent features include the Emoji's with positive, negative Sentiment orientation Service condition.
Further, the algorithm that the training pattern uses includes predicting that the grader of the category attributes such as gender is calculated Method, the regression algorithm for predicting the numerical attributes such as age.
Further, the evaluation index includes calculating suitable for indexs such as the accuracys rate of classifier algorithm, suitable for returning The indexs such as the mean square error of method.
Further, it is determined that after the evaluation index, parameter selection is carried out to the algorithm of use, the algorithm is made to exist to find out The one group of parameter to behave oneself best in evaluation index, it by training set random division is k deciles that the method for parameter selection, which is, is rolled over using k The mode of cross validation carrys out selection parameter, and step includes:
For every group of parameter, carry out training pattern using k-1 parts of data every time;
With remaining a data come test model, then it can train and test k times;
K performance of comprehensive every group of parameter, selects one group of best parameter of mean apparent.
A kind of user's portrait describing system based on Emoji service conditions, including memory and processor, the memory Computer program is stored, described program is configured as being performed by the processor, and described program includes each step of the above method Instruction.
Compared with prior art, the positive effect of the present invention is:
The Emoji that user version is mainly utilized in the method for the present invention portrays user's portrait, without analyzing user's input Sensitive content in text, greatly protects privacy of user.In addition, since Emoji is that a kind of whole world is general and flow very much Capable language, the universality of this method are preferable.Using the method for the present invention, the transnational enterprise of user's throughout world various regions no longer needs A set of dedicated user's portrait is all designed for each complicated language and portrays technology.
Description of the drawings
Fig. 1 is a kind of user's portrait depicting method flow chart based on Emoji service conditions of the present invention.
Specific embodiment
Features described above and advantage to enable the present invention are clearer and more comprehensible, special embodiment below, and institute's attached drawing is coordinated to make Detailed description are as follows.
The present embodiment provides a kind of user's portrait depicting method based on Emoji service conditions, as shown in Figure 1, can be divided into Following two large divisions.
First, the core of this method is to extract the service condition of Emoji based on user has identification to user's portrait The feature of degree, the part mainly include including two steps:
1st, Text Pretreatment
It is cleaned firstly the need of the user version data to collection, structure is encoded using the Emoji of Unicode official definitions The regular expression that Emoji is matched in the text is produced, the regular expression is applied on every text, so as to mistake Filter be not Emoji content of text, obtain the initial data for the Emoji that each user uses.
2nd, characterizing definition
It is constructed from the initial data of the Emoji available for machine learning and the quantization of deep learning algorithm Emoji uses feature, specifically includes:
1) Emoji frequency of use feature
From overall particle size, the text using Emoji can be obtained during Emoji is washed out from the text This ratio.From more fine granularity, for each sentence, it can extract wherein using going out in the number of Emoji and sentence Various patterns during existing multiple Emoji, for example whether multiple Emoji etc. are used continuously in a sentence.
2) Emoji uses preference profiles
Emoji is using preference, that is, different user in the text using the different selections of Emoji, for example women is compared to man Property, more love uses bright Emoji.The initial data of Emoji is used by user, it can be deduced that each user is for every The access times of a Emoji use preference profiles as it.Deeper into ground, it is also contemplated that each user most likes to be used continuously Which Emoji etc. is used as feature.
3) Emoji emotions intent features
People online lower exchange when, can be expressed using facial expression and limb action etc. to enrich the language of itself, allow other side Easily more accurately understand the intention of oneself.When exchanging on line, due to missings such as facial expressions, user transfers to use Emoji serves as this clue.In view of sociology and psychology find different types of people for facial expression etc. using Difference, such as women than men are more frequent using facial expression, and the emotion intention that Emoji is used is also served as a kind of area by the present invention Divide the feature of user.First, official definitions of the analysis Unicode for each Emoji is gone using the sentiment analysis tool such as LIWC, The affective tag of each Emoji is obtained, so as to mark off the Emoji of positive, negative, ameleia tendency.Finally, for every A user, it can be deduced that its service condition for positive, negative Emoji is intended to special as its Emoji emotions used Sign.
2nd, it can be drawn a portrait with training user using feature based on above-mentioned Emoji and portray model, the tool that training process is related to Body step is as follows:
1st, data set divides
For training pattern, the user using attribute information (such as age, gender etc.) known to a batch is needed, and extract They use the feature of Emoji, by these users according to a certain percentage (such as 5:1) training set and test set are divided into.Wherein The attribute information of user and Emoji are used for training pattern using feature in training set.Then, by the user's in test set Emoji is put into using feature in model, and model can provide the prediction result of the attribute to user each in test set, this is tied The real property of fruit and user are compared and are calculated specific evaluation index, you can the effect of scoring model.
2nd, evaluation index selects
Cheng Qian is crossed in model training, it is thus necessary to determine that the evaluation index of good model.If using classifier algorithm predictability Not Deng category attributes, evaluation index can select the indexs such as accuracy rate (Accuracy).Accuracy rate refers to predicting correct sample Number accounts for the ratio of total number of samples.Certainly, for different application scenarios, it can also consider different evaluation indexes.For example, one special The advertiser in male market is attacked, target user is male, and advertiser concern will be that algorithm can reflect in much degree Do not go out male user, the evaluation index of model is that male user is predicted accurate ratio.It is if pre- using regression algorithm The numerical attributes such as age are surveyed, mean square error (MSE) etc. can be selected to be used as evaluation index.Mean square error refers to the actual value of attribute With the variance of predicted value, mean square error is bigger, illustrates that numerical value that model prediction goes out and actual value gap are bigger.
3rd, algorithms selection
User's portrait portrays the problem of being a various dimensions, can portray the much informations such as gender, the age of user.Such as Fruit is to portray the category attributes such as gender, can be with selection sort device algorithm, and for gender, the output result of algorithm is man, female two Classification.If portraying the numerical attributes such as age, regression algorithm can be selected, for the age, the output of algorithm is the number at age Value.After evaluation index is determined, for each algorithm selected, parameter selection is all carried out, finding out can comment each algorithm The one group of parameter to behave oneself best in valency index.It is k deciles by training set random division during parameter selection, is then rolled over and intersected using k The mode of verification carrys out selection parameter.Specifically, for every group of parameter, carry out training pattern using k-1 parts of data every time, then with surplus Under that part of data carry out test model, can train and test k times in this way.For each algorithm, k table of comprehensive every group of parameter It is existing, select the optimal parameter combination of mean apparent.Finally, the best parameter group of each algorithm has been obtained, as based on this The optimal models that algorithm obtains.These optimal models are applied on test set, the Emoji of user in test set is used into feature Model is put into, model can provide prediction result, and the true portrait information of prediction result and these users are calculated evaluation index, Select the model to behave oneself best in evaluation index user's portrait model the most final.
The user's portrait model obtained using above-mentioned training, it would be desirable to which the Emoji of the user portrayed inputs mould using feature Type, output are prediction result of the model for the portrait of the user.
The above embodiments are merely illustrative of the technical solutions of the present invention rather than is limited, the ordinary skill of this field Personnel can be modified or replaced equivalently technical scheme of the present invention, without departing from the spirit and scope of the present invention, this The protection domain of invention should be subject to described in claims.

Claims (10)

  1. The depicting method 1. a kind of user based on Emoji service conditions draws a portrait, step include:
    The text data for obtaining the portrait information of a collection of user and inputting;
    The initial data of Emoji is extracted from the text data using regular expression;
    The Emoji that the user is obtained according to the initial data of the Emoji uses feature;
    The user is divided into training set and test set;
    Emoji by the use of the user in the training set uses feature as independent variable, and portrait information is trained as dependent variable Model;
    The model trained is applied on the test set, the best model of evaluation index is selected and draws a portrait as final user Portray model.
  2. 2. according to the method described in claim 1, it is characterized in that, the Emoji code constructions using Unicode official definitions go out The regular expression.
  3. 3. according to the method described in claim 1, it is characterized in that, the Emoji includes Emoji frequency of use spy using feature Sign, Emoji use preference profiles, Emoji emotion intent features.
  4. 4. according to the method described in claim 3, it is characterized in that, the Emoji frequency of use feature includes using in text Emoji's uses mould when using multiple Emoji in number and a sentence in ratio that the text of Emoji accounts for, a sentence Type.
  5. 5. according to the method described in claim 3, it is characterized in that, the Emoji includes the color of Emoji using preference profiles Selection, the use of Emoji selection, the continuous use of Emoji selection.
  6. 6. according to the method described in claim 3, it is characterized in that, the Emoji emotions intent features include have it is positive, The service condition of the Emoji of negative Sentiment orientation.
  7. 7. according to the method described in claim 1, it is characterized in that, the algorithm that the training pattern uses includes predicting class Classifier algorithm, the regression algorithm for predicting numerical attribute of other attribute, the category attribute include gender, the numerical value category Property include the age.
  8. 8. the method according to the description of claim 7 is characterized in that the evaluation index includes:
    Suitable for the index of classifier algorithm, which includes accuracy rate;
    Suitable for the index of regression algorithm, which includes mean square error.
  9. 9. method according to claim 7 or 8, which is characterized in that after determining the evaluation index, to the algorithm of use into Row parameter selection, to find out the algorithm is made to behave oneself best in evaluation index one group of parameter, the method for parameter selection is will to instruct White silk integrates random division as k deciles, includes in a manner that k rolls over cross validation come selection parameter, step:
    For every group of parameter, carry out training pattern using k-1 parts of data every time;
    With remaining a data come test model, then it can train and test k times;
    K performance of comprehensive every group of parameter, selects one group of optimal parameter of mean apparent.
  10. Describing system, including memory and processor, the memory 10. a kind of user based on Emoji service conditions draws a portrait Computer program is stored, described program is configured as being performed by the processor, and described program includes the claims 1 to 9 Each step instruction of any the method.
CN201711261393.1A 2017-12-04 2017-12-04 A kind of user's portrait depicting method and system based on Emoji service conditions Pending CN108256548A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711261393.1A CN108256548A (en) 2017-12-04 2017-12-04 A kind of user's portrait depicting method and system based on Emoji service conditions

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711261393.1A CN108256548A (en) 2017-12-04 2017-12-04 A kind of user's portrait depicting method and system based on Emoji service conditions

Publications (1)

Publication Number Publication Date
CN108256548A true CN108256548A (en) 2018-07-06

Family

ID=62722364

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711261393.1A Pending CN108256548A (en) 2017-12-04 2017-12-04 A kind of user's portrait depicting method and system based on Emoji service conditions

Country Status (1)

Country Link
CN (1) CN108256548A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110929285A (en) * 2019-12-10 2020-03-27 支付宝(杭州)信息技术有限公司 Method and device for processing private data

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101419527A (en) * 2007-10-19 2009-04-29 株式会社理光 Information processing, outputting and forming device, and user property judgement method
CN105160016A (en) * 2015-09-25 2015-12-16 百度在线网络技术(北京)有限公司 Method and device for acquiring user attributes
CN105701074A (en) * 2016-01-04 2016-06-22 北京京东尚科信息技术有限公司 Character processing method and apparatus
WO2016113967A1 (en) * 2015-01-14 2016-07-21 ソニー株式会社 Information processing system, and control method
CN106708983A (en) * 2016-12-09 2017-05-24 竹间智能科技(上海)有限公司 Dialogue interactive information-based user portrait construction system and method
CN107423442A (en) * 2017-08-07 2017-12-01 火烈鸟网络(广州)股份有限公司 Method and system, storage medium and computer equipment are recommended in application based on user's portrait behavioural analysis

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101419527A (en) * 2007-10-19 2009-04-29 株式会社理光 Information processing, outputting and forming device, and user property judgement method
WO2016113967A1 (en) * 2015-01-14 2016-07-21 ソニー株式会社 Information processing system, and control method
CN105160016A (en) * 2015-09-25 2015-12-16 百度在线网络技术(北京)有限公司 Method and device for acquiring user attributes
CN105701074A (en) * 2016-01-04 2016-06-22 北京京东尚科信息技术有限公司 Character processing method and apparatus
CN106708983A (en) * 2016-12-09 2017-05-24 竹间智能科技(上海)有限公司 Dialogue interactive information-based user portrait construction system and method
CN107423442A (en) * 2017-08-07 2017-12-01 火烈鸟网络(广州)股份有限公司 Method and system, storage medium and computer equipment are recommended in application based on user's portrait behavioural analysis

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110929285A (en) * 2019-12-10 2020-03-27 支付宝(杭州)信息技术有限公司 Method and device for processing private data

Similar Documents

Publication Publication Date Title
CN108491377A (en) A kind of electric business product comprehensive score method based on multi-dimension information fusion
CN108984530A (en) A kind of detection method and detection system of network sensitive content
CN109376251A (en) A kind of microblogging Chinese sentiment dictionary construction method based on term vector learning model
CN109829166B (en) People and host customer opinion mining method based on character-level convolutional neural network
CN107391483A (en) A kind of comment on commodity data sensibility classification method based on convolutional neural networks
CN108364199B (en) Data analysis method and system based on Internet user comments
CN105205699A (en) User label and hotel label matching method and device based on hotel comments
CN108388660B (en) Improved E-commerce product pain point analysis method
CN107657056B (en) Method and device for displaying comment information based on artificial intelligence
CN107862087A (en) Sentiment analysis method, apparatus and storage medium based on big data and deep learning
CN106815194A (en) Model training method and device and keyword recognition method and device
CN106407235B (en) A kind of semantic dictionary construction method based on comment data
CN107391575A (en) A kind of implicit features recognition methods of word-based vector model
CN108319734A (en) A kind of product feature structure tree method for auto constructing based on linear combiner
KR20120109943A (en) Emotion classification method for analysis of emotion immanent in sentence
CN108108468A (en) A kind of short text sentiment analysis method and apparatus based on concept and text emotion
CN109033166B (en) Character attribute extraction training data set construction method
CN111666761A (en) Fine-grained emotion analysis model training method and device
CN106909572A (en) A kind of construction method and device of question and answer knowledge base
CN110321549B (en) New concept mining method based on sequential learning, relation mining and time sequence analysis
CN110569354A (en) Barrage emotion analysis method and device
CN106569996B (en) A kind of Sentiment orientation analysis method towards Chinese microblogging
CN108090099A (en) A kind of text handling method and device
CN115017320A (en) E-commerce text clustering method and system combining bag-of-words model and deep learning model
Schmøkel et al. FBAdLibrarian and Pykognition: open science tools for the collection and emotion detection of images in Facebook political ads with computer vision

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180706