CN105653518A

CN105653518A - Specific group discovery and expansion method based on microblog data

Info

Publication number: CN105653518A
Application number: CN201510997788.2A
Authority: CN
Inventors: 吴松泽; 张华平; 徐程程; 王洋; 王�琦; 李高超; 付戈
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2015-12-25
Filing date: 2015-12-25
Publication date: 2016-06-08

Abstract

The invention relates to a specific group discovery and expansion method based on microblog data, and belongs to the field of social network analysis and data mining. The specific group discovery and expansion method comprises the following specific steps: collecting relevant group information; carrying out information integration and mapping; aiming at text data to carry out characteristic extraction; calculating a user similarity degree; carrying out the self-detection of a category group; and extracting the attributes of the specific group, judging a category, and carrying out group expansion. The specific group discovery and expansion method artfully avoids the problem that group identification can not be carried out since data is sparse or incomplete when a network model is used, inputs large-scale data calculation and is high in stability.

Description

A kind of special group based on microblogging data is found and extending method

Technical field

This method relates to discovery and the expansion for some particular text colonies in social networks, especiallyThe special group of microblogging data is found and is expanded, belongs to social networks analysis and Data Mining.

Background technology

The information of the issue that in social networks, user can be autonomous oneself also can be seen other people dividing simultaneouslyEnjoy information, and then build the community network of virtual age. This shared platform have timely sharing,Real-time, the feature such as interactive, also possess the propagation characteristic of traditional social activity society simultaneously, becomesFor the part of people's work and life.

In microblogging platform, a large amount of text datas that user is produced, carrying out data mining can obtainVery high information value. Therefore, need to come with efficient data digging method and machine learning algorithmCarry out the excavation of useful information, fully the valuable information extraction in social networks text message is gone outCome, wherein valuable information is exactly discovery and the expansion of social media colony.

When referring to full dose text data taxonomic clustering, the feature extraction of text data need to carry out dataFeature extraction, selects the feature of larger, the key vocabulary of weight as text, is convenient to similarity and calculatesAnd taxonomic clustering. The technology wherein relating to has participle, word frequency statistics, keyword extraction etc., forThe weight calculation of each word uses word frequency or TF-IDF carries out weight calculation, and feature extraction algorithm masterHave: information gain algorithm, whether the gain of comentropy can after mainly calculating different feature extractionsReach maximum, and the characteristic vector that finally reservation can be got maximum informational entropy; Mutual information value-based algorithm, this isConsider the classification information of text, calculated the Mutual information entropy of classification information and feature, retained mutual informationLarge feature; , by there is not independence between suppositive and classification in CHI (chi method)And adopt chi to extract, chi value is higher, the independence between this word and classificationJust less, illustrate that its feature is more obvious, so just extract as feature; Cross entropy (KL distance),By calculating cross entropy, reflect the probability distribution of text categories and occurring certain specific termDistance under condition between classification probability distribution, by the affecting property of this distance metric noun on classification,KL distance is larger, illustrates that the impact property on kind judging of this noun is larger.

At social media data excavation applications, there is the community discovery algorithm of a lot of maturations. A societyGroup is the colony that a group has high contiguity and similitude, and community discovery algorithm passes through between userRelevant information, judges user's impact property and homogeneity and is finally converted into the detection foundation characteristic of corporations.The algorithm of dividing for corporations at present has the community discovery algorithm based on modularity, the society based on analysis of spectrumGroup's discovery algorithm, the community discovery of propagating based on information-theoretical community discovery algorithm and based on label are calculatedThe main flow algorithms such as method. Among these algorithms, can the algorithm of large-scale application in Practical Project beThe community discovery algorithm calculating based on modularity that the people such as Newman propose, this algorithm has defined one and has sentencedNot Zhi: the modularity Q of corporations, each by changing adding of a new element of consideration (node)In generation, whether the value of calculating Q obtains gain, and finally reaches a steady constant result. ThisUnder algorithm frame, need to carry out network modelling to social media, then according to its network structure with associatedRelation is carried out community discovery and division. For social media such as microbloggings, network modelling mainly utilizes microbloggingUser's bean vermicelli concern relation, forwarding relation etc. carried out the limit of building of graph model, finally according to the net of structureNetwork model carries out modularity calculating.

But in practical problem is processed, due to the sparse property of data, we are difficult to obtain completelyData acquisition system, this has brought very large challenge to our traditional computational methods. Many times, weCan not always obtain full dose data, as bean vermicelli relation, concern relation, forwarding information etc., thisIn situation, because data is incomprehensive, we cannot obtain complete relational network, if continuedUse traditional algorithm that is calculated as core based on modularity just cannot accurately calculate each colony closelyDegree and modularity. Therefore,, for sparse relational network, we need to introduce computation model more cleverlyCarry out the discovery of colony of corporations.

Summary of the invention

Consider in mass data the most easily obtain and comparatively comprehensively information be that social user deliversText data information, we have proposed to find and extending method based on the colony of text data, mainCarry out natural language processing and finally extract this user's characteristic information for user's text data,And carry out modeling according to characteristic information, finally carry out cluster by the similitude comparing between each userAnalyze, finally obtain corporations of colony, and the outstanding feature that extracts this colony carries out colony's expansion, separateIn the sparse situation of customer relationship link data of having determined, cannot accurately carry out the problem of colony's discovery.

The technical scheme that the present invention proposes has following step:

Step 1, collection Reference Group information: based on crawler technology or the discloseder data resources of microblogging,Get the community information that needs analysis, these information spinners will comprise: the text envelope that microblog users is sent outThe text message of the comment that breath, user do, the interactive information that user carries out on microblogging, compriseComment operation, forwarding relation, put and praise operation,, user's base attribute, comprise bean vermicelli number, concernNumber, concern relation;

Step 2, to community information integrate with mapping: in the sample data of obtaining in step 1,First remove label, and by hierarchical relationship resolution data, obtain user-microblogging text mapping, userThe mapping of-comment text, and retain user-concern relation, user-bean vermicelli relation, user-forwarding relation;

Step 3, carry out feature extraction for text data: the microblogging content of delivering for user,Use relative entropy (being KL distance) to carry out feature extraction, obtain each user's feature vocabulary, andSet up corresponding mapping relations;

Step 4, calculating user similarity: according to the text feature extracting in step 3, use cosineSimilarity carry out user similarity calculate, and according to similarity result to user carry out cluster orIt is classification;

Step 5, carry out classification colony from detect: for dividing based on text feature data in step 4The colony obtaining, carries out the conclusion of symbolic characteristic to whole colony, obtain the characteristic of each classificationAccording to, be specially total N feature in a colony of hypothesis, take the decision mode of majority voting, adoptRepresentative feature by maximum K the feature of general character as this colony;

Step 6, the special group attributes extraction of carrying out, judge classification, carries out colony's expansion. Need toThe special group sample data of finding is as training data, and uses in step 3 according to this training dataText feature extraction algorithm obtain feature vocabulary, and in category feature, calculate similar according to step 4Degree, obtains cluster correlation user list.

Beneficial effect

This method is used the information of microblogging Chinese version data, can fully obtain to a certain extent user's letterBreath, and adopt feature extraction algorithm to obtain user's feature, avoid cleverly in use network modelSparse or the problem that comprehensively can not carry out colony's identification, the present invention is carrying out colony's discoveryMeanwhile, obtained the characteristic information of each classification, can understand these data to us provides more sideHelp, there is very strong use value. Colony based on text data finds and extending method has been realized numberAccording to make full use of, conveniently carry out colony's discovery and need not set up complicated network model, fromAnd reduced the complexity of algorithm, and the modularity of algorithm is higher, can drop into large-scale data meterCalculate, there is higher stability.

Brief description of the drawings

Fig. 1 finds based on the colony of microblogging text data and the schematic flow sheet of extending method;

Fig. 2 carries out the structural representation of web crawlers collection for microblogging data;

Fig. 3 analyzes and builds the text vector spatial model that can quantize to the text data gatheringAnd mapping;

Fig. 4 utilizes KL distance carry out text feature extraction and carry out colony and find model training process;

Fig. 5 represents to use the model training to carry out colony's discovery and colony expands.

Detailed description of the invention

Below in conjunction with brief description of the drawings the specific embodiment of the present invention:

Based on the colony of microblogging text data find and the overall procedure of extending method as shown in Figure 1, withSina's microblogging " machine learning " domain-specific personage is found to be example, and we are in advance according to stepOne has set up the representative feature storehouse of multiple classification colonies and colony to step 5, and one of them classification is" machine learning " class, has then found 50 doubtful associated users, and target is to find out real being correlated withUser, thereby " machine learning " class is expanded. Concrete grammar is to 50 doubtful relevant useFamily obtains feature vocabulary vector separately according to step 1 to step 3, then carries out class according to step 6Not Pi Pei, judge whether user belongs to " machine learning " class. It is below the each step according to algorithmSuddenly carry out detailed implementation.

Carry out the collection of relevant information according to step 1:

The Sina's microblogging data that will study for us gather or directly obtain Sina's microblogging and providePublic data. As shown in Figure 2, the collection of data cushions URL queue by foundation, adopts rangeFirst search algorithm (BFS) carries out web page interlinkage search, and each node webpage is scanned to download,And the page is resolved, remove irrelevant noise, reservation can be described the metadata of user's attributeInformation: the microblogging text message that user delivers, the microblogging text message of user comment, user's bean vermicelliNumber, user's concern number, user's forwarding relation; Also can directly call microblogging official providesThe feedback information such as api interface or RSS directly extract relevant information;

Carry out integration and the mapping of information according to step 2:

After applying step one has obtained metadata, these metadata are carried out to the integration storage of data,And set up corresponding mapping relations, as shown in Figure 3, specific works comprises:

1) text participle, the microblogging text message (delivering microblogging, comment microblogging) to user usesICTCLAS Words partition system carries out text participle, removes stop words, obtains corresponding text vector spatial modeType (VSM model);

2) data based on having finished dealing with, set up user-microblogging text vector spatial mappings, withTime can also obtain the mappings such as user-forwarding relation, user-bean vermicelli relation, user-concern relation, thoughSo this algorithm does not relate to calculating and the processing of these mappings, but this excavates and have later data secondaryVery large value.

Carry out feature extraction according to step 3 for text data:

The text vector spatial model that utilizes step 2 to obtain, as shown in Fig. 4 feature extraction part, adoptsKL algorithm calculates cross entropy:

C E (W) = - Σ_{i = 1}^{m} P (C_{i} | W) l o g \frac{P (C_{i} | W)}{P (C_{i})}

Wherein m represents the number of classification, and this value is by User Defined; C_iRepresent i classification, WBe the vocabulary that needs tolerance, CE (W) is cross entropy. KL distance reflects that a vocabulary W is to eachClassification C_iImpact property summation, KL distance is larger, this vocabulary more can affect the division of classification, soThis vocabulary is that feature vocabulary vector can be classified as in key vocabulary.

Calculate user's similarity according to step 4:

According to calculating the feature vocabulary that extracts the user who obtains, build a vocabulary vectorV＝{v_i＝1|v_iFeature vocabulary }, so for the characteristic vector V of two user version data_jAnd V_k, adoptBy the computational methods of cosine similarity, calculate these two users' similitude, as Fig. 4 similarity is calculatedShown in part:

Wherein, c represents the length (being the size of text vector spatial model vector) of vocabulary, v_jiRepresent j user's i vocabulary, v_kiRepresent k user's i vocabulary, v_ji=1 generationA table the j user's i vocabulary is feature vocabulary, on the contrary v_ji=0 represents it is not feature vocabulary. WithCosine value between family is larger, represents that these two user characteristicses are more relevant, can assert this two usersFor same classification.

Carry out classification colony from detecting according to step 5:

Similarity by step 4 is calculated, and colony is divided with classification or clustering algorithm, firstStep obtains colony in social networks and finds. If Fig. 4 is from as shown in test section, obtaining the division of colonyAfterwards, colony is carried out, from detecting, extracting the most representative feature in this colony (classification),As the feature of this colony. The object of this step is to be to detect into user-defined special groupAnd expand and prepare. Suppose total N feature in a colony, take the decision mode of majority voting,Adopt the representative feature of maximum K the feature of general character as this colony.

Carry out special group attributes extraction according to step 6, judge classification, carry out colony's expansion:

As shown in Figure 5, when user need to find which colony certain user belongs to, and need to expand phaseWhen pass or similar users colony, the sample providing according to user carries out attributes extraction, and main acquisition is specialThe microblogging text data at requisition family, carries out natural language processing to it, comprises participle, word weight TF-IDFCalculate, then adopt equally KL distance to carry out feature extraction, then carry out similarity meter with each classificationCalculate, finally determine which classification this user belongs to, and can carry out colony's expansion to similar sample.In this process, if the similarity of a user and multiple colonies is all very approaching, error is no more than, at this moment just there is the problem of overlapping corporations in the threshold value θ that user sets, a user mayBelong to multiple colonies, the result that need to divide colony is carried out verification, needs results set simultaneouslyMerge and finally improve colony's expansion.

Claims

1. the special group based on microblogging data is found and an extending method, and its feature comprises following stepRapid:

Step 1, collection Reference Group information: based on crawler technology or the discloseder data resources of microblogging,Get and need the community information analyzed, these information comprise: text message that microblog users is sent out,The text message of the comment that user does, the interactive information that user carries out on microblogging, comprise commentOperation, forwarding relation, point are praised operation, and user's base attribute comprises bean vermicelli number, pays close attention to number, closesNote relation;

Step 3, carry out feature extraction for text data: the microblogging content of delivering for user,Carry out feature extraction with relative entropy, obtain each user's feature vocabulary, and set up and reflect accordinglyPenetrate relation;

Step 6, the special group attributes extraction of carrying out, judge classification, carries out colony's expansion: need to send outExisting special group sample data is as training data, and uses in step 3 according to this training dataText feature extraction algorithm obtains feature vocabulary, and in category feature, calculates similarity according to step 4,Obtain cluster correlation user list.