CN105488211A - Method for determining user group based on feature analysis - Google Patents

Method for determining user group based on feature analysis Download PDF

Info

Publication number
CN105488211A
CN105488211A CN201510924814.9A CN201510924814A CN105488211A CN 105488211 A CN105488211 A CN 105488211A CN 201510924814 A CN201510924814 A CN 201510924814A CN 105488211 A CN105488211 A CN 105488211A
Authority
CN
China
Prior art keywords
user
colony
node
feature
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510924814.9A
Other languages
Chinese (zh)
Inventor
董政
吴文杰
陈露
李学生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Mo Yun Science And Technology Ltd
Original Assignee
Chengdu Mo Yun Science And Technology Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Mo Yun Science And Technology Ltd filed Critical Chengdu Mo Yun Science And Technology Ltd
Priority to CN201510924814.9A priority Critical patent/CN105488211A/en
Publication of CN105488211A publication Critical patent/CN105488211A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Primary Health Care (AREA)
  • Marketing (AREA)
  • Human Resources & Organizations (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Health & Medical Sciences (AREA)
  • General Business, Economics & Management (AREA)
  • Economics (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method for determining a user group based on feature analysis. The method comprises the following steps of collecting user information and a social content on a social web server, analyzing user features and identifying a specific user group based on the analyzed features. According to the method for determining the user group based on feature analysis, which is provided by the invention, identification accuracy and timeliness of the Internet social group can be effectively improved.

Description

The customer group defining method that feature based is analyzed
Technical field
The present invention relates to large data, particularly the customer group defining method analyzed of a kind of feature based.
Background technology
Along with the development of mobile Internet, the social networks in life is moved on internet, has brought the change of information exchange system, and change traditional interpersonal communication mode, to the every field of social life, there is profound significance.Can link up widely between user, interactive, by writing, transfer, the means such as collection operate text data.In social networks, always there is part of nodes and connect relatively tightr, these nodes are then relatively sparse with the contact between other nodes, this part can be connected node closely thus and be classified as same colony.Colony, as a kind of important social networks attribute, brings huge challenge to public sentiment control and network supervision virtually.If not to group relation fully identification, then None-identified group interest, recommend content of interest, more cannot endanger information by Timeliness coverage, safeguard good network environment.
Summary of the invention
For solving the problem existing for above-mentioned prior art, the present invention proposes the customer group defining method that a kind of feature based is analyzed, comprising:
User profile on social network sites server and social content are gathered, analyzes the feature of user, identify specific user colony based on analyzed feature.
Preferably, the feature of described analysis user, identifies specific user colony based on analyzed, comprises further:
First colony to be identified is described, and takes out one group of lists of keywords according to group property, be i.e. population characteristic word; Secondly, the user detected is identified, find the user node belonging to this colony; In user behavior filter process, adopt character string canonical to mate individual subscriber attribute is mated with population characteristic word, if comprise these Feature Words in individual subscriber attribute or user's name text data, then this user is divided to colony to be identified;
In user behavior filters, utilize the text data that following process process is produced by user in social networks, calculate the similarity between user and colony:
First a N gt U based on population characteristic word is set up, expression specific as follows:
U=[T l,T 2,T 3,...,T N]
The wherein T representative frequency vector that certain Feature Words occurs in colony, the subscript of N representation feature word;
Secondly, utilize text segmentation to the full text P of user A aprocess:
P A=[key 1,key 2,...,key N],
Wherein key value is the frequency vector that in user conversation text, each Feature Words occurs
Whether the behavioural characteristic relatively between user version data and colony is close:
sim(A,U)=(P A·U)/||(P A||||U||)
If similarity sim (A, U) exceedes predetermined threshold value, then this user node A is divided in colony U;
Data structure is utilized to be described conversation procedure; The user participating in session is linked together with relation, is built into the colony based on individual event; The last node adopted in social networks topology in the strong relation colony of node measurement index identification, is finally stored to file with tree-like hierarchical structure by this event; Wherein said strong relation colony is specifically defined as, if known colony α meets: for each user node i in colony α, all meet number of nodes that i and colony α interior nodes form and be greater than the number of nodes that this node and colony α exterior node form, then colony α is called as strong relation colony.
The present invention compared to existing technology, has the following advantages:
The present invention proposes the customer group defining method that a kind of feature based is analyzed, effectively improve the recognition accuracy of the social cohort in internet and ageing.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of the customer group defining method analyzed according to the feature based of the embodiment of the present invention.
Embodiment
Detailed description to one or more embodiment of the present invention is hereafter provided together with the accompanying drawing of the diagram principle of the invention.Describe the present invention in conjunction with such embodiment, but the invention is not restricted to any embodiment.Scope of the present invention is only defined by the claims, and the present invention contain many substitute, amendment and equivalent.Set forth many details in the following description to provide thorough understanding of the present invention.These details are provided for exemplary purposes, and also can realize the present invention according to claims without some in these details or all details.
An aspect of of the present present invention provides the customer group defining method that a kind of feature based is analyzed.Fig. 1 is the customer group defining method process flow diagram analyzed according to the feature based of the embodiment of the present invention.
In order to complete the population analysis to social networks, first set up data acquisition system (DAS) to gather the data on social network sites server, wherein data type comprises: user profile is if user ID, user name, text data are as session id, session text, and relation data is as paid close attention to list and follower's list.This system comprises with lower module: user profile acquisition, text data acquisition, social networks generation, de-redundancy, multithreading, data storage, priority selection, token batch obtain.Master control thread in data acquisition system (DAS) carries out purview certification, program initialization, seed node reading, filtration, database manipulation; Data acquisition thread carries out data acquisition by API open interface, and gatherer process comprises interface requests, json Data Analysis, pointer renewal, finally returns to master control total number of threads according to list.In the selection that de-redundancy calculates, the present invention adopts binary vector and a series of random mapping function.For crawl seed ID list, user ID list, relation list, session id with the addition of de-redundancy function respectively, seed list, crawl user list, social list all carry out with its unique ID, the ID of two users is then grouped together by the form of relation, and the sequencing both distinguishing, the former is for being concerned, the follower that the latter is the former.System with the addition of corresponding operating in multiple module: when extracting seed ID, and multithreading adds mutual exclusion lock to the operation of database; For each thread distributes crawl task, as the acquisition of thread 1 responsible text; Thread 2 obtains userspersonal information; For differentiated permutation and combination is carried out in each thread token resource storehouse.And a breakpoint file is set separately for each thread, the position that record captures.DataBase combining, closedown, inquiry, increase, deletion action are carried out unified management by database module, and first the ID capturing object inputs to file by manual type, all loads a priority file before starting to capture task at every turn.In point task process on crawl object, for each thread formulates a set of specific crawl task, the one or more processing targets chosen from user profile acquisition, text acquisition, Relation acquisition.From the control of speed, system proposes two kinds of regulative modes altogether, and one is the quantity controlling thread, and two is the data volumes obtained after adjustment API request.
Individual subscriber attribute can reflect the characteristic of user, and this characteristic provides the strong feature identified needed for colony just.First the present invention is described colony to be identified by manual type, and takes out one group of lists of keywords according to these group properties, i.e. population characteristic word.Secondly, utilize filtering user information module to identify the user detected, find the user node belonging to this colony.In filter process, adopt character string canonical to mate individual subscriber attribute is mated with population characteristic word, if comprise these Feature Words in the text datas such as individual subscriber attribute or user's name, then this user is divided to colony to be identified.
The text data that the process of user behavior filtering module is produced by the subjective desire of user in social networks, utilizes the similarity between following process computation user and colony.
First a N gt U based on population characteristic word is set up, expression specific as follows:
U=[T l,T 2,T 3,...,T N]
The wherein T representative frequency vector that certain Feature Words occurs in colony, the subscript of N representation feature word.
Secondly, utilize text segmentation to the full text P of user A aprocess.
P A=[key 1,key 2,...,keyN]
sim(A,U)=(P A·U)/||(P A||||U||)
Here key value is the frequency vector that in user conversation text, each Feature Words occurs, whether the behavioural characteristic relatively between user version data and colony is close, if similarity sim (A, U) exceedes predetermined threshold value, then this user node A is divided in colony U.After this node adds colony, population characteristic word can gather along with user in colony the text data dynamic change produced, and identifies the potential Feature Words in current group.
In social networks filtering module, whether the attribute of a relation identification unknown node that invention applies in social networks belongs to colony.If known colony α meets following requirement, then colony α is called as strong relation colony: for each user node i in colony α, all meets number of nodes that i and colony α interior nodes form and is greater than the number of nodes that this node and colony α exterior node form.
Adopt following methods to carry out strong relation Stock discrimination, first conversation procedure is reduced, described with data structure; Secondly the user participating in session is linked together with real relation, be built into the colony based on individual event; The last node adopted in social networks topology in the strong relation colony of corresponding node measurement index identification.
The present invention analyzes for the conversational axiom of information in social networks, and transfers the registration of Party membership, etc. from one unit to another the event evolves process of rediscover in passing through, and finally with tree-like hierarchical structure, this event is stored to file.
The remark information that one is pointed to superior node can be comprised in each session topology, the father node of certain specific node can be found accordingly.Every bar session also all can safeguard a transfer list, records user and the comment of all this information of transfer, can find the child node collection of this information node accordingly.On the basis of session tree, by the true relation between user, the node participating in session is built into relational network.Obtain real social networks.Adopt API to combine with web analysis and jointly close injecting method, set up the topology of social networks, utilize each node L to complete concern to participation event session user u, if it can thus be appreciated that u ipay close attention to u j, then node L and u ithere is common concern, i.e. u jnode.Obtain u in this way iother nodes intragroup whether are paid close attention to.
Carry out in the process of group identification at utilization semanteme, relation, user data, first the semantic information of candidate user is extracted, on this basis semantic information is screened as identical semantic user with the user that session title mates, again social networks analysis is carried out to identical semantic user, the user before relationship analysis rank is screened as new candidate user.Candidate user is divided into again text associated user and relation associated user.In iterative process each time, relation associated user produces text associated user by semantic analysis, then calculates the session title degree of association threshold value of text associated user, thus obtains target group.
Candidate user set uses symbol us to represent, search engine is utilized to obtain initial candidate user set, concrete steps are as follows: obtain population characteristic word, retrieve in a search engine, the result of retrieval being captured, obtaining the link information delivering the user of content of text, by analyzing the link information of above-mentioned user, the social content of each user is captured, as initialization candidate user.
The candidate user set us produced in i-th iterative process irepresent, its candidate user u ijrepresent, us iwith u ijbetween relation can be expressed as:
us i=(u i1,…u ij)j<N i
N irepresent the number of the candidate user produced in i-th iterative process.
Candidate user is divided into text associated user, relation associated user and colony's node usually according to different generative processes and particular community.
The first step that semantic analysis is model iteration is carried out to correlation candidate user.Candidate user is the relation associated user of last iteration.The session text of user is analyzed, carrys out the degree of correlation between more each user and special session title by the calculating user conversation title degree of association.If there is the relational users set after i-th model iteration, in order to obtain the text associated user set of the i-th+1 time, to each element in relational users set i.e. each text associated user, given semantic key words, calculates the session title degree of association of each text associated user.The session title degree of association of user i equals this user and occurs the text sum of the number of times of keyword divided by user, and the session title association angle value of a user i is higher, illustrates that the degree of association between user i and this session title is higher.By calculating the user conversation title degree of association, telling which user and associating closely with this session title.
After obtaining text associated user set, determine which text associated user is effective, obtain colony's node.By calculating the number of the unduplicated session title association angle value of text associated user, and then obtain the TopN threshold value of colony's node.
If the text associated user calculated after i-th iteration has M, wherein non-repetitive user has MU.Then, the top n user of colony's node is expressed as:
To M text associated user according to the descending sort of session title association angle value, the top n user after sequence is effective, and namely this top n user is a member in colony.Just they can be added colony's node set as colony's node after obtaining N number of user.
After obtaining colony's node, by increasing sample, expand hunting zone.From the network of personal connections of colony's node, the candidate user of deep layer is identified by social networks analysis.Social networks analysis comprises step:
Obtain follower and the person of being concerned in colony's node and gather the vector network chart of formation.The common attention rate of each user in computational grid, i.e. in follower's set of user i, every two followers form the number of times paid close attention to mutually.Common attention rate is greater than the user of predefine threshold value, is required relational users.
After obtaining relational users, iteration can be continued to model, continue by data grabber program the Social behaviors capturing relational users, thus semantic analysis is carried out to it.
In sum, the present invention proposes the customer group defining method that a kind of feature based is analyzed, effectively improve the recognition accuracy of the social cohort in internet and ageing.
Obviously, it should be appreciated by those skilled in the art, above-mentioned of the present invention each module or each step can realize with general computing system, they can concentrate on single computing system, or be distributed on network that multiple computing system forms, alternatively, they can realize with the executable program code of computing system, thus, they can be stored and be performed by computing system within the storage system.Like this, the present invention is not restricted to any specific hardware and software combination.
Should be understood that, above-mentioned embodiment of the present invention only for exemplary illustration or explain principle of the present invention, and is not construed as limiting the invention.Therefore, any amendment made when without departing from the spirit and scope of the present invention, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.In addition, claims of the present invention be intended to contain fall into claims scope and border or this scope and border equivalents in whole change and modification.

Claims (2)

1. a customer group defining method for feature based analysis, is characterized in that, comprising:
User profile on social network sites server and social content are gathered, analyzes the feature of user, identify specific user colony based on analyzed feature.
2. method according to claim 1, is characterized in that, the feature of described analysis user, identifies specific user colony based on analyzed, comprise further:
First colony to be identified is described, and takes out one group of lists of keywords according to group property, be i.e. population characteristic word; Secondly, the user detected is identified, find the user node belonging to this colony; In user behavior filter process, adopt character string canonical to mate individual subscriber attribute is mated with population characteristic word, if comprise these Feature Words in individual subscriber attribute or user's name text data, then this user is divided to colony to be identified;
In user behavior filters, utilize the text data that following process process is produced by user in social networks, calculate the similarity between user and colony:
First a N gt U based on population characteristic word is set up, expression specific as follows:
U=[T l,T 2,T 3,...,T N]
The wherein T representative frequency vector that certain Feature Words occurs in colony, the subscript of N representation feature word;
Secondly, utilize text segmentation to the full text P of user A aprocess:
P A=[key 1,key 2,...,key N],
Wherein key value is the frequency vector that in user conversation text, each Feature Words occurs
Whether the behavioural characteristic relatively between user version data and colony is close:
sim(A,U)=(P A·U)/||(P A||||U||)
If similarity sim (A, U) exceedes predetermined threshold value, then this user node A is divided in colony U;
Data structure is utilized to be described conversation procedure; The user participating in session is linked together with relation, is built into the colony based on individual event; The last node adopted in social networks topology in the strong relation colony of node measurement index identification, is finally stored to file with tree-like hierarchical structure by this event; Wherein said strong relation colony is specifically defined as, if known colony α meets: for each user node i in colony α, all meet number of nodes that i and colony α interior nodes form and be greater than the number of nodes that this node and colony α exterior node form, then colony α is called as strong relation colony.
CN201510924814.9A 2015-12-11 2015-12-11 Method for determining user group based on feature analysis Pending CN105488211A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510924814.9A CN105488211A (en) 2015-12-11 2015-12-11 Method for determining user group based on feature analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510924814.9A CN105488211A (en) 2015-12-11 2015-12-11 Method for determining user group based on feature analysis

Publications (1)

Publication Number Publication Date
CN105488211A true CN105488211A (en) 2016-04-13

Family

ID=55675186

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510924814.9A Pending CN105488211A (en) 2015-12-11 2015-12-11 Method for determining user group based on feature analysis

Country Status (1)

Country Link
CN (1) CN105488211A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106022938A (en) * 2016-06-02 2016-10-12 北京奇艺世纪科技有限公司 Social network user association dividing method and social network user association dividing device
CN106095839A (en) * 2016-06-03 2016-11-09 北京网智天元科技股份有限公司 The extraction of specific viewing population data and processing method thereof
CN107256231A (en) * 2017-05-04 2017-10-17 腾讯科技(深圳)有限公司 A kind of Team Member's identification equipment, method and system
CN108564467A (en) * 2018-05-09 2018-09-21 平安普惠企业管理有限公司 A kind of determination method and apparatus of consumer's risk grade
CN108647301A (en) * 2018-05-09 2018-10-12 平安普惠企业管理有限公司 A kind of creation method and terminal device of customer relationship net
CN109389157A (en) * 2018-09-14 2019-02-26 阿里巴巴集团控股有限公司 A kind of user group recognition methods and device and groups of objects recognition methods and device
CN109815406A (en) * 2019-01-31 2019-05-28 腾讯科技(深圳)有限公司 A kind of data processing, information recommendation method and device
CN110046910A (en) * 2018-12-13 2019-07-23 阿里巴巴集团控股有限公司 The method and apparatus for obtaining customer group relevant to particular customer
TWI670662B (en) * 2017-11-09 2019-09-01 財團法人資訊工業策進會 Inference system for data relation, method and system for generating marketing targets
CN110197207A (en) * 2019-05-13 2019-09-03 腾讯科技(深圳)有限公司 To not sorting out the method and relevant apparatus that user group is sorted out

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103218400A (en) * 2013-03-15 2013-07-24 北京工业大学 Method for dividing network community user groups based on link and text contents
CN103793460A (en) * 2013-11-22 2014-05-14 清华大学 Method and system for sensing specific community on line on basis of social network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103218400A (en) * 2013-03-15 2013-07-24 北京工业大学 Method for dividing network community user groups based on link and text contents
CN103793460A (en) * 2013-11-22 2014-05-14 清华大学 Method and system for sensing specific community on line on basis of social network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
姜京池: "社交网络的团体感知与挖掘方法研究", 《中国优秀硕士学位论文全文数据库》 *
李蕾: "微博特定群体发现模型研究", 《中国优秀硕士学位论文全文数据库》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106022938A (en) * 2016-06-02 2016-10-12 北京奇艺世纪科技有限公司 Social network user association dividing method and social network user association dividing device
CN106095839A (en) * 2016-06-03 2016-11-09 北京网智天元科技股份有限公司 The extraction of specific viewing population data and processing method thereof
CN107256231A (en) * 2017-05-04 2017-10-17 腾讯科技(深圳)有限公司 A kind of Team Member's identification equipment, method and system
CN107256231B (en) * 2017-05-04 2022-04-22 腾讯科技(深圳)有限公司 Team member identification device, method and system
TWI670662B (en) * 2017-11-09 2019-09-01 財團法人資訊工業策進會 Inference system for data relation, method and system for generating marketing targets
CN108564467A (en) * 2018-05-09 2018-09-21 平安普惠企业管理有限公司 A kind of determination method and apparatus of consumer's risk grade
CN108647301A (en) * 2018-05-09 2018-10-12 平安普惠企业管理有限公司 A kind of creation method and terminal device of customer relationship net
CN109389157A (en) * 2018-09-14 2019-02-26 阿里巴巴集团控股有限公司 A kind of user group recognition methods and device and groups of objects recognition methods and device
CN110046910A (en) * 2018-12-13 2019-07-23 阿里巴巴集团控股有限公司 The method and apparatus for obtaining customer group relevant to particular customer
CN109815406A (en) * 2019-01-31 2019-05-28 腾讯科技(深圳)有限公司 A kind of data processing, information recommendation method and device
CN109815406B (en) * 2019-01-31 2022-12-13 腾讯科技(深圳)有限公司 Data processing and information recommendation method and device
CN110197207A (en) * 2019-05-13 2019-09-03 腾讯科技(深圳)有限公司 To not sorting out the method and relevant apparatus that user group is sorted out

Similar Documents

Publication Publication Date Title
CN105488211A (en) Method for determining user group based on feature analysis
CN110457404B (en) Social media account classification method based on complex heterogeneous network
CN105512301A (en) User grouping method based on social content
Zhang et al. Online social network profile linkage
Bartunov et al. Joint link-attribute user identity resolution in online social networks
Wu et al. Adaptive spammer detection with sparse group modeling
Interdonato et al. Multilayer network simplification: approaches, models and methods
CN111125460B (en) Information recommendation method and device
Shi et al. Event detection and identification of influential spreaders in social media data streams
CN107633444B (en) Recommendation system noise filtering method based on information entropy and fuzzy C-means clustering
CN112488716B (en) Abnormal event detection system
CN106980651B (en) Crawling seed list updating method and device based on knowledge graph
CN107341199B (en) Recommendation method based on document information commonality mode
De Boom et al. Semantics-driven event clustering in Twitter feeds
Sriramoju Review on Big Data and Mining Algorithm
CN113422761A (en) Malicious social user detection method based on counterstudy
CN108536866B (en) Microblog hidden key user analysis method based on topic transfer entropy
CN110688549A (en) Artificial intelligence classification method and system based on knowledge system map construction
CN105589935A (en) Social group recognition method
Hu et al. Co-clustering enterprise social networks
Gu et al. Ideology detection for twitter users with heterogeneous types of links
Li et al. Keyword-based correlated network computation over large social media
Thota et al. Early rumor detection in social media based on graph convolutional networks
CN110543601B (en) Method and system for recommending context-aware interest points based on intelligent set
CN112380455A (en) Method for directionally and covertly acquiring data of international and foreign internet based on backtracking security controlled network access channel

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20160413

RJ01 Rejection of invention patent application after publication