CN107862620A

CN107862620A - A kind of similar users method for digging based on social data

Info

Publication number: CN107862620A
Application number: CN201711311721.4A
Authority: CN
Inventors: 李开宇; 王月超
Original assignee: Sichuan XW Bank Co Ltd
Current assignee: Sichuan XW Bank Co Ltd
Priority date: 2017-12-11
Filing date: 2017-12-11
Publication date: 2018-03-30

Abstract

A kind of similar users method for digging based on social data of present invention offer, the field of social network being related in Internet information technique, including：Step 1：Critical data is crawled from the microblogging text of user；Step 2：TOPN keywords are extracted from critical data；Step 3：General word2vec models are trained according to TOPN keywords；Step 4：Calculate user interest vector；Step 5：Cosine similarity calculating is carried out two-by-two to user interest vector, obtains the Interest Similarity between two users；Step 6：Similar users are filtered out according to Interest Similarity, similar users is completed and excavates.The present invention solve at present by the strong correlation of this method of other users in user's issuing microblog, comment, forwarding it is very sparse, be unfavorable for extensive similar users and open up newly, it is low to result in recall rate；And this method of similar situation of big V user is paid close attention in the case of the Interest Similarity degree of routine calculates and still has Sparse by user, the problem of final recall rate is still relatively low.

Description

A kind of similar users method for digging based on social data

Technical field

The present invention relates to the field of social network in Internet information technique, more particularly to a kind of estimation and social data Similar users method for digging.

Background technology

Microblogging receives the concern of more and more users as a kind of new social networking service.According to statistics, have daily Thousands of new users adds microblog, while generates hundreds of millions of micro-blog informations.Increasing businessman simultaneously, Also enterprise account is opened one after another, receives bean vermicelli (objective group).By carrying out specific aim marketing to client, Brang Awareness is established, As the main flow means of brand promotion, the marketing.How from existing client, association expands new user, always each The problem of enterprise is thought deeply, user's extended method is carried out based on social big data, is increasingly recognized.Seeking interest similar users is One important means of user's extension, the existing method for finding similar users include two kinds, below, by both approaches Step and determination carry out corresponding explanation.

The first, the similar users using the other users of@in user's issuing microblog, comment and forwarding as the user, this The method that kind finds similar users is very simple, and directly by crawling the microblogging of the user, parse just can find pass therein System；For user with the interest phase same sex between user be present, this is caused by the exclusive functional attributes of microblogging, is a kind of strong correlation Property.But this strong correlation is very sparse, it is unfavorable for extensive similar users and opens up newly, it is low to result in recall rate.

Second, determine to judge according to the similar situation for paying close attention to big V user between user similar between two users Degree.Steps of the method are：(1) the concern list of user is crawled, and is sieved by the bean vermicelli amount of user in list or self brief introduction Select wherein all big V users with medium property；(2) the interactive number (bag between the user and each big V user is gathered Comment is included, forwards and thumbs up), and the bean vermicelli amount of big V user.

(3) user is calculated to each big V user's by formula " interest index=interactive number/big V beans vermicelli number " Interest index；And obtain the vector of each big V user interests index；(4) again by calculating two users V big to two respectively The vectorial cosine similarity of the interest index of user, to represent Interest Similarity between two users.(5) by calculating a user With treating user in expanded set, similar number and similar average value, to judge whether the user is the user to be extended.This side Method can preferably tackle the Similarity Measure of user's hot topic interest, but be calculated for the Interest Similarity degree of routine, however it remains Bottleneck be present in the situation of Sparse, similar users mining effect.Such as the big V or brand of two minorities, A blogers and B blogers, They belong to same field, and the common factor of bean vermicelli is little, and a user pays close attention to A blogers, b user pays close attention to B blogers, and a belongs to similar to b User, but can not be calculated in second method, equally, the final recall rate of this method for digging is relatively low.

The content of the invention

It is an object of the invention to：For solve it is existing by other users in user's issuing microblog, comment, forwarding this The strong correlation of kind of method is very sparse, is unfavorable for extensive similar users opens up newly, and it is low to result in recall rate；And by using This method of similar situation that big V user is paid close attention at family calculates the feelings for still having Sparse for the Interest Similarity of routine Condition, the problem of final recall rate is still relatively low, the present invention provide a kind of similar users method for digging based on social data.

Technical scheme is as follows：

A kind of similar users method for digging based on social data, comprises the following steps：

Step 1：Critical data is crawled from the microblogging text of user；

Step 2：TOPN keywords are extracted from critical data；

Step 3：General word2vec models are trained according to general language material, by the TopN keywords of two users, calculated Interest vector between user.

Step 4：Cosine similarity calculating is carried out two-by-two to user interest vector, obtains the Interest Similarity between two users；

Step 5：Similar users are filtered out according to Interest Similarity, similar users is completed and excavates.

Specifically, in the step 1, the critical data crawled from user's microblogging text includes：(1) user's issue is micro- The rich text data with interactive microblogging；(2) user pays close attention to the table data of big V blogers；(3) the big V blogers issue of user's concern Microblogging text data.

Preferably, the big V blogers are that microblogging bean vermicelli number is at least 10W, and the critical data is that user is closely trimestral Data.

Specifically,, will be from user's microblogging before TOPN keywords are extracted from critical data in the step 2 In the text data that crawls carry out NLP processing, NLP processing includes word segmentation processing and goes stop words to handle.

Specifically, in the step 2, TOPN keywords are calculated using TextRank sort methods, specific step is： Assuming that the microblogging set that all users issue is considered as into a document D, every microblogging of wherein user u issues is mi, and mi ∈ D, each candidate keywords k word frequency TF_kThe frequency that the word occurs in microblogging document mi is represented, contains the micro- of keyword k Reverse document frequency IDFs of the rich mi in whole microblogging document D_kFor：

Wherein：{m：K ∈ m } the microblogging quantity containing keyword k is represented, | du | represent the total quantity of user's issuing microblog； Word V is calculated using TextRank sort methods_iThe fractional formula of importance is：

Wherein, S (V_k) represent word k importance, TF_kRepresent the frequency that word k occurs in a document, E (V_k) represent The set of words that word k occurred altogether, with appearance as co-occurrence, w in a sentence_jkRepresent that word j's and word k is similar Degree, w_jiWord j and word i similarity is represented, word j is word k similar word, and word i is similar to word j Word, occurrence number of the similarity equal to word i and word j divided by the number sum occurred respectively；Then changed again based on formula In generation, the importance scores of each word are obtained, be ranked up according to the size of importance scores, the larger word of retention score is made For the TOPN keywords extracted.

The specific steps of the step 3 include：(1) Chinese language material is downloaded；(2) to the Chinese language material carry out participle and Remove stop words；(3) using Open-Source Tools training word2vec models；(4) TOPN keywords will be obtained in step 2 to change respectively For multiple term vectors；(5) the multiple term vector is all added up and obtains the interest vector of user.

After such scheme, beneficial effects of the present invention are as follows：

(1) present invention broken it is traditional only according to by other users in user's issuing microblog, comment, forwarding this Kind of method, or pay close attention to by user this method of similar situation of big V user, it is proposed that from the comprehensive social data of user To excavate similar users, big V blogers can regard a bridge as, draw more interactive microblog datas.Even if it is not same One big V blogers, but their interactive content of microblog are similar, for example be all that amusement discloses bloger, cuisines bloger, wears and take bloger, Also similar users can be calculated as, here it is the reason for recall rate expansion, after the method using the present invention, to the recall rate phase of client Than being compared in the method for digging for being solely focused on big V user's similar situation, recall rate increases to 66.72% from 37.2%, recall rate Almost it is doubled, its effect highly significant.

Because the scale of the similar users of excavation is bigger, the information crawled is more, and amount of calculation increases, and calculating speed, which slows down, is Inevitably, can be controlled by filtering user, such as originally according to 10,000 usage mining similar users, now by 1 Ten thousand user filterings go out representational 1,000.

(2) traditional TextRank methods are to extract keyword using word as both candidate nodes so as to construct non-directed graph, The fraction of each noun node in non-directed graph is calculated, the shadow calculated without considering the weight information of word in itself node fraction Ring, if not considering word importance, some unessential words, can constantly diffuse out more inessential words so that follow-up phase Seriously reduced like degree accuracy, therefore, the word information in this paper statistic documents, on the basis of word co-occurrence number is considered, Addition passes through TF_k×IDF_kThe term weighing being calculated calculates fraction, selects the larger noun of fraction as candidate key Word, so, this guarantees accuracy rate.

Embodiment

Example is applied below in conjunction with the present invention, the technical scheme in the present embodiment is clearly and completely described, is shown So, described embodiment is only the part of the embodiment of the present invention, rather than whole embodiments.Based in the present invention Embodiment, the every other embodiment that those of ordinary skill in the art are obtained under the premise of creative work is not made, all Belong to the scope of protection of the invention.

The similar users method for digging based on social data in the present embodiment, comprises the following steps：

Step 1：Critical data is crawled from the microblogging text of user；The critical data crawled from user's microblogging text Including：(1) text data of user's issuing microblog and interactive microblogging；(2) user pays close attention to the table data of big V blogers；(3) user The microblogging text data of the big V blogers issue of concern.Big V blogers are that microblogging bean vermicelli number is at least 10W, if excavation amount demand is big 5W can be used, and as threshold values is excavated, the critical data is the closely trimestral data of user.

Step 2：TOPN keywords are extracted from critical data；, before TOPN keywords are extracted from critical data The text data crawled from user's microblogging is subjected to NLP processing, NLP processing includes word segmentation processing and goes stop words to handle. During specific extraction TOPN keywords, TOPN keywords are calculated using TextRank sort methods, specific step is：Assuming that The microblogging set that all users issue and the microblogging set that the big V of concern is issued are considered as a document D, wherein user u hairs Every microblogging of cloth is mi, and mi ∈ D, each candidate keywords k word frequency TF_kRepresent what the word occurred in microblogging document mi Frequency, reverse document frequency IDFs of the microblogging mi containing keyword k in whole microblogging document D_kFor：

Wherein：{m：K ∈ m } the microblogging quantity containing keyword k is represented, | du | represent the total quantity of user's issuing microblog； Using TextRank sort methods calculate word importance fractional formula be：

Traditional TextRank methods are to extract keyword using word as both candidate nodes so as to construct non-directed graph, are calculated The fraction of each noun node, the calculation formula of conventional method are in non-directed graph：

In formula, d is damping coefficient, is traditionally arranged to be 0.85, S (V_k) represent word k importance, TF_kRepresent word V_i The frequency occurred in document, E (V_k) represent word V_kThe set of words occurred altogether, it is co-occurrence with always occurring in a sentence, w_jkRepresent word j and word k similarity, w_jiWord i and word j similarity is represented, similarity is equal to word j and word k Occurrence number divided by the number sum that occurs respectively.It is not difficult to find out by above formula, traditional TextRank methods simply consider Weight of the co-occurrence number of word as side, the shadow calculated without considering the weight information of word in itself node fraction Ring.Therefore, the word information in this paper statistic documents, on the basis of word co-occurrence number is considered, addition passes through TF_k×IDF_k The term weighing being calculated calculates fraction, selects the larger noun of fraction as candidate keywords.

Step 3：General word2vec models are trained according to general language material, by the TopN keywords of two users, calculated Similarity between two users；Word2vec models are a deep learning models, and text-processing can be reduced to a K dimension Vector space in vector operation, vector operation can express the similitude between word.(1) Chinese language material is downloaded；(2) to institute State Chinese language material and segmented and gone stop words；(3) using Open-Source Tools training word2vec models.Trained according to Open-Source Tools Word2vec models, input language material is exactly operation program, obtains word and the one-to-one data (i.e. model) of vector；So TopN keywords can be mapped to vector, and user interest amount is arrived after crucial term vector is cumulative.

Step 4：Cosine similarity calculating is carried out two-by-two to user interest vector, obtains the Interest Similarity between two users. Cosine similarity, also known as cosine similarity, it is to assess their similarity by calculating two vectorial included angle cosine values. Vector according to coordinate value, is plotted in vector space, such as most common two-dimensional space by cosine similarity.

Then their angle to be tried to achieve again, and draws cosine value corresponding to angle, this cosine value can is used for characterizing, this Two vectorial similitudes.Angle is smaller, and closer to 1, their direction more coincide cosine value, then more similar.

Step 6：Similar users are filtered out according to Interest Similarity, similar users is completed and excavates.

It is obvious to a person skilled in the art that the invention is not restricted to the details of above-mentioned one exemplary embodiment, Er Qie In the case of without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter From the point of view of which point, embodiment all should be regarded as exemplary, and be nonrestrictive, the scope of the present invention is by appended power Profit requires rather than described above limits, it is intended that all in the implication and scope of the equivalency of claim by falling Change is included in the present invention.

Moreover, it will be appreciated that although the present specification is described in terms of embodiments, not each embodiment is only wrapped Containing an independent technical scheme, this narrating mode of specification is only that those skilled in the art should for clarity Using specification as an entirety, the technical solutions in the various embodiments may also be suitably combined, forms those skilled in the art It is appreciated that other embodiment.

Claims

1. a kind of similar users method for digging based on social data, it is characterised in that comprise the following steps：

Step 1：Critical data is crawled from the microblogging text of user；

Step 2：TOPN keywords are extracted from critical data；

Step 3：General word2vec models, then the TopN keywords by two users are trained according to general language material, calculates and uses The interest vector at family；

A kind of 2. similar users method for digging based on social data according to claim 1, it is characterised in that the step In rapid one, the critical data crawled from user's microblogging text includes：

(1) text data of user's issuing microblog and interactive microblogging；

(2) user pays close attention to the table data of big V blogers；

(3) the microblogging text data of the big V blogers issue of user's concern.

3. a kind of similar users method for digging based on social data according to claim 2, it is characterised in that described big V blogers are that microblogging bean vermicelli number is at least 10W, and the critical data is the closely trimestral data of user.

A kind of 4. similar users method for digging based on social data according to claim 1, it is characterised in that the step In rapid two, before TOPN keywords are extracted from critical data, the text data crawled from user's microblogging is subjected to NLP Processing, NLP processing include word segmentation processing and go stop words to handle.

A kind of 5. similar users method for digging based on social data according to claim 1 or 4, it is characterised in that institute State in step 2, TOPN keywords are calculated using TextRank sort methods, specific step is：Assuming that all users are issued Microblogging set and the microblogging text data of big V blogers issue of user's concern be considered as a document D, wherein user u issues Every microblogging be mi, and mi ∈ D, each candidate keywords k word frequency TF_kRepresent the frequency that the word occurs in microblogging document mi Rate, reverse document frequency IDFs of the microblogging mi containing keyword k in whole microblogging document D_kFor：

<mrow> <msub> <mi>IDF</mi> <mi>k</mi> </msub> <mo>=</mo> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mrow> <mo>(</mo> <mfrac> <mrow> <mo>|</mo> <mi>d</mi> <mi>u</mi> <mo>|</mo> </mrow> <mrow> <mn>1</mn> <mo>+</mo> <mo>|</mo> <mo>{</mo> <mi>m</mi> <mo>:</mo> <mi>k</mi> <mo>&Element;</mo> <mi>m</mi> <mo>}</mo> <mo>|</mo> </mrow> </mfrac> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow>

Wherein：{m：K ∈ m } the microblogging quantity containing keyword k is represented, | du | represent the total quantity of user's issuing microblog；Utilize The importance scores formula that TextRank sort methods calculate word k is：

<mrow> <mi>S</mi> <mrow> <mo>(</mo> <msub> <mi>V</mi> <mi>k</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <mi>d</mi> <mo>)</mo> </mrow> <mo>&times;</mo> <msub> <mi>TF</mi> <mi>k</mi> </msub> <mo>&times;</mo> <msub> <mi>IDF</mi> <mi>k</mi> </msub> <mo>+</mo> <mi>d</mi> <mo>&times;</mo> <msub> <mi>&Sigma;</mi> <mrow> <msub> <mi>V</mi> <mi>j</mi> </msub> <mo>&Element;</mo> <mi>E</mi> <mrow> <mo>(</mo> <msub> <mi>V</mi> <mi>k</mi> </msub> <mo>)</mo> </mrow> </mrow> </msub> <mfrac> <msub> <mi>w</mi> <mrow> <mi>j</mi> <mi>k</mi> </mrow> </msub> <mrow> <msub> <mi>&Sigma;</mi> <mrow> <msub> <mi>V</mi> <mi>i</mi> </msub> <mo>&Element;</mo> <mi>E</mi> <mrow> <mo>(</mo> <msub> <mi>V</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> </msub> <msub> <mi>w</mi> <mrow> <mi>j</mi> <mi>i</mi> </mrow> </msub> </mrow> </mfrac> <mi>S</mi> <mrow> <mo>(</mo> <msub> <mi>V</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow>

Wherein, S (V_k) represent word k importance, TF_kRepresent the frequency that word k occurs in a document, E (V_k) represent word k The set of words occurred altogether, with appearance as co-occurrence, w in a sentence_jkRepresent word j and word k similarity, w_ji Word j and word i similarity is represented, word j is word k similar word, and word i is the similar word with word j, similar Occurrence number of the degree equal to word i and word j divided by the number sum occurred respectively；Then it is iterated, is obtained based on formula again The importance scores of each word, are ranked up according to the size of importance scores, and the larger word of retention score is as extraction The TOPN keywords gone out.

6. according to a kind of similar users method for digging based on social data described in claim 1, it is characterised in that the step Three specific steps include：(1) Chinese language material is downloaded；(2) segmented and gone stop words to the Chinese language material；(3) use Open-Source Tools train word2vec models；(4) the TOPN keywords obtained in step 2 are respectively converted into multiple term vectors； (5) the multiple term vector is all added up and obtains the interest vector of user.