CN111444402A

CN111444402A - Analysis method for community detection based on index construction and social factor control network

Info

Publication number: CN111444402A
Application number: CN201911036341.3A
Authority: CN
Inventors: 朱海; 李雪威; 王文俊; 武南南
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2019-10-29
Filing date: 2019-10-29
Publication date: 2020-07-24

Abstract

The invention discloses an analysis method for carrying out community detection based on index construction and a social factor control network, which mainly comprises the following two steps of firstly constructing a generalized causal relationship network according to a social factor control theory, then constructing an index according to an FTV framework theory, inquiring and excavating a community structure of the network; establishing a causal relationship network by a social factor control theory, using the established network in the implementation step, then performing query work based on an FTV framework theory, and excavating a community structure in the network; and constructing a dictionary structure in the network.

Description

Analysis method of community detection based on index construction and social factor control network

技术领域technical field

本发明属于网络分析领域，是一种基于索引构建进行查询，社会因控论进行分析的方法。首先根据社会因控关系构建因控关系网络，然后使用索引技术来提高查询的速度和准确性，并以此来分析网络中的社区关系。The invention belongs to the field of network analysis, and is a method for querying based on index construction and analyzing social control theory. Firstly, construct a network of social factors and control relationships, and then use indexing technology to improve the speed and accuracy of query, and use this to analyze community relationships in the network.

背景技术Background technique

近年来，随着社交网络的普及和发展，越来越多的用户产生了大量的数据，如何来从海量的数据中分析其中可能的社区结构，成为了网络分析领域的一个挑战。这些年来，海量数据带来的问题，由于类似hadoop这类技术的出现，已经慢慢从数据存储转移到网络构建和网络分析上来，从海量数据分析出可能社区结构，对于各种领域都有很大的作用。例如，从社交网络中分析出各种潜在的社区，能挖掘出各种欺诈团伙，这对于净化网络安全，有着非凡的意义。这一部分主要介绍网络分析中社区检测的研究现状。In recent years, with the popularization and development of social networks, more and more users have generated a large amount of data. How to analyze the possible community structure from the massive data has become a challenge in the field of network analysis. Over the years, the problems caused by massive data, due to the emergence of technologies such as hadoop, have gradually shifted from data storage to network construction and network analysis, and analyzing possible community structures from massive data is very useful in various fields. big effect. For example, analyzing various potential communities from social networks can uncover various fraudulent gangs, which is of extraordinary significance for purifying network security. This part mainly introduces the research status of community detection in network analysis.

针对海量数据的社区检测目前已经进行了很多研究。海量图数据的构成往往有两种类型，一种是由海量数据组成的超大规模的图构成，包括社交网络、万维网、电商交易网络等等。这种类型的网络，例如社交网络中，图中每个节点代表人，每一条边代表人与人之间的关系，这种类型的图的查询，最早开始被认为是NPC问题由KARP.R.M提出,社区检测的目的是找出网络中紧密相连的关系和人，由此可以分析可能发生的事件，比如对危险诈骗团伙的发掘。KARP提出使用最大完全图的方法对海量数据图进行查询和社区检测建模。另外一种海量的小范围图组成的图网络，比如说化合物网络。在众多化合物组成的网络中，每个原子代表一个节点，每一条边代表原子之间的作用力。这类问题可以使用子图近似匹配的方法来进行查询，但是这个问题也是NPC问题，在1976年J.Ullmannn使用回溯法首次提出了可解方式。本发明专利针对第一种图查询类型。A lot of research has been done on community detection for massive data. The composition of massive graph data often has two types. One is the super-large graph composition composed of massive data, including social networks, World Wide Web, e-commerce transaction networks and so on. In this type of network, such as social network, each node in the graph represents a person, and each edge represents the relationship between people. The query of this type of graph was first considered to be the NPC problem by KARP.R.M. It is proposed that the purpose of community detection is to find closely connected relationships and people in the network, from which possible events can be analyzed, such as the discovery of dangerous fraud gangs. KARP proposes to use the largest complete graph method to query and model community detection on massive data graphs. Another kind of graph network is composed of a large number of small-scale graphs, such as a compound network. In a network of many compounds, each atom represents a node, and each edge represents the force between atoms. This type of problem can be queried using the approximate matching method of subgraphs, but this problem is also an NPC problem. In 1976, J.Ullmannn first proposed a solvable method using the backtracking method. The patent of the present invention is aimed at the first type of graph query.

按照KARP.R.M的方法理论，虽然海量数据图查询问题由NPC问题转化为了可解问题，但是查询速度太慢，尤其是当今数据激增，这种方法就更加难以适应当前环境，后来V.Bonnici提出了一种解决框架，过滤后验证框架(FTV框架)，这种框架极大的加快了查询速度和提高了查询的精度。According to the method theory of KARP.R.M, although the massive data graph query problem has been transformed from the NPC problem into a solvable problem, the query speed is too slow, especially with the current surge of data, this method is more difficult to adapt to the current environment, and later V.Bonnici proposed A solution framework is proposed, the post-filtering verification framework (FTV framework), which greatly speeds up the query speed and improves the query accuracy.

本方法首先根据社会因控理论相关关系，对网络进行建模，构建因果关系网络，然后依据FTV框架理论对构建的网络进行查询分析，从而达到更好的匹配，依据索引查询技术重构社区检测方法，本方法不论实际的实验效果方面还是对于后续扩展性研究都有十分重要的意义。This method firstly models the network according to the correlation of social factor control theory, constructs a causal network, and then queries and analyzes the constructed network according to the FTV framework theory, so as to achieve better matching, and reconstructs the community detection according to the index query technology. This method is of great significance both in terms of actual experimental effects and for subsequent extended research.

发明内容SUMMARY OF THE INVENTION

本方法主要是挖掘海量数据图中的社区结构，通过FTV框架理论来加快查询速度和提高查询准确度，从而能更加快速的在大规模静态图数据中挖掘社区结构。本方法在欺诈团伙检测、相同兴趣小组推荐和事件爆发预警等相关场景中有很大的应用价值。This method mainly mines the community structure in the massive data graph, and uses the FTV framework theory to speed up the query speed and improve the query accuracy, so that the community structure can be mined more quickly in the large-scale static graph data. This method has great application value in related scenarios such as fraud gang detection, same-interest group recommendation, and event outbreak warning.

本方案主要分为以下两个步骤进行，首先根据依存句法抽取因果关系，然后利用抽取的因果关系构建泛化因果关系网络。This scheme is mainly divided into the following two steps. First, the causal relationship is extracted according to the dependency syntax, and then the generalized causal relationship network is constructed by using the extracted causal relationship.

本方法主要分为以下两个步骤进行，首先根据社会因控理论来构建泛化的因果关系网络，然后依据FTV框架理论构建索引，进行查询，挖掘网络的社区结构。This method is mainly divided into the following two steps. First, a generalized causal relationship network is constructed according to the social factor control theory, and then an index is constructed according to the FTV framework theory, and the query is carried out to mine the community structure of the network.

社会因控理论构建因果关系网络，实施步骤如下：The social causal control theory builds a causal relationship network, and the implementation steps are as follows:

步骤一，构建网络。使用当前的pyspider框架，爬取微博中博客内容和微博中好友关系列表数据，作为本方法的实证数据。The first step is to build the network. Use the current pyspider framework to crawl the blog content in Weibo and the data of friend relationship list in Weibo, as the empirical data of this method.

对博客的具体内容进行数据处理，使用复旦大学的分词器把用户的博文内容进行划分，并剔除无关语气词，使用FNLP关键词提取对输入的博文数据，提取关键词。然后对词性进行划分，然后进行语义解析，进行查询抽象((见图1))。Data processing is carried out on the specific content of the blog, using the word segmenter of Fudan University to divide the content of the user's blog post, and removing irrelevant particles, using FNLP keyword extraction to extract the input blog post data and extract keywords. Parts of speech are then divided, followed by semantic parsing and query abstraction (see Figure 1).

首先根据提取到的数据关系构建初等SNA网络，然后根据b中提取到的语义进行文本语义内容提取，挖掘网络中可能的节点和边，从而能够使得构建的网络更加密集，减少网络稀疏性。根据语义分析模型，在本方法中我们使用一个例子说明如何进行语义提取。比如博文中“我很讨厌xxx”，其中“xxx”为一个人名，提取博文中，“我”即代表博客中博文作者ID，“讨厌”是一个关系动词，“xxx”即为另外一个对象，由此可以提取出以该博文作者为节点的一个隐藏的边。Firstly, an elementary SNA network is constructed according to the extracted data relationship, and then the text semantic content is extracted according to the semantics extracted in b, and the possible nodes and edges in the network are mined, so that the constructed network can be more dense and the network sparsity can be reduced. According to the semantic analysis model, in this method we use an example to illustrate how to perform semantic extraction. For example, in the blog post "I hate xxx", where "xxx" is a person's name, in the extraction of the blog post, "I" represents the ID of the author of the blog post in the blog, "hate" is a relational verb, and "xxx" is another object. From this, a hidden edge with the blog post author as a node can be extracted.

使用构建的网络，然后基于FTV框架理论，进行查询工作，挖掘网络中社区结构(见图2)：Use the constructed network, and then based on the FTV framework theory, perform query work to mine the community structure in the network (see Figure 2):

步骤二，构建网络中字典结构。这个字典结构是以后建立查询索引的基础。The second step is to construct the dictionary structure in the network. This dictionary structure is the basis for building query indexes in the future.

字典结构由图中的路径构成，以不超过p长度的路径的图作为一个索引，这个索引被称为“指纹”。The dictionary structure is composed of paths in the graph, and the graph of paths not exceeding p length is used as an index, and this index is called "fingerprint".

通过首先将指纹(查询FQ和数据库FD)分成长整数序列(查询LQ和数据库LD)，然后使用位图操作(LQ∧)测试每对(LQ，LD)之间的位图包含来执行位图包含。LD＝LQ)。Bitmap is performed by first splitting the fingerprint (query FQ and database FD) into a sequence of long integers (query LQ and database LD), then using bitmap operations (LQ∧) to test bitmap inclusion between each pair (LQ, LD) Include. LD=LQ).

我们通过添加记录边缘标签的新字段来修改字典库中节点的结构，每个指纹都以完全相同的方式构造，然而，现在所获得的指纹的每个比特被附加到每个相应的位图的末尾。(见图四)。We modify the structure of the nodes in the dictionary base by adding new fields that record edge labels, each fingerprint is constructed in exactly the same way, however, now each bit of the obtained fingerprint is appended to each corresponding bitmap's end. (see Figure 4).

并且通过在两个子节点之间执行二进制OR运算来构造父节点。如果查询指纹与树的给定节点之间的比较返回false，则可以直接丢弃节点下面的整个分支。因此，如果数据库足够大，则搜索字典库将需要比数据库指纹数量更少的比较。And the parent node is constructed by performing a binary OR operation between the two child nodes. If the comparison between the query fingerprint and a given node of the tree returns false, the entire branch below the node can be discarded directly. Therefore, if the database is large enough, searching the dictionary base will require fewer comparisons than the database fingerprints.

为了能够将“指纹”应用于位图集，必须定义距离测量和平均计算方法。必须最大化同一群集的元素之间共同的1的数量，以便最小化二进制OR中的1。如果b 1和b 2是两个位图，则b 1和b 2之间的距离，表示为d(b 1，b 2)，在当前提议中如下定义：To be able to apply a "fingerprint" to a set of bitmaps, distance measurement and averaging methods must be defined. The number of 1s in common between elements of the same cluster must be maximized in order to minimize 1s in binary OR. If b 1 and b 2 are two bitmaps, the distance between b 1 and b 2, denoted as d(b 1, b 2), is defined in the current proposal as follows:

根据f步骤中的方法，即可构建查询字典库，根据字典库可以直接查询海量数据的社团结构，加快查询速度和提高查询精度。According to the method in step f, a query dictionary base can be constructed, and the community structure of massive data can be directly queried according to the dictionary base, so as to speed up the query and improve the query accuracy.

有益效果beneficial effect

对于现有的海量数据图挖掘方法，主要是采用了社会因控理论还构建因果关系网络，但是仅仅使用了社交关系相关理论，没有使用完全的因果句法关系，但是查询结果的精度和速度还算令人满意，本方法主要有以下几个方面的增益：For the existing massive data graph mining methods, the social factor control theory is mainly used to build a causal relationship network, but only the social relationship correlation theory is used, and the complete causal syntactic relationship is not used, but the accuracy and speed of the query results are not bad. Satisfactory, this method mainly has the following gains:

首先，在构建社交网络时，不仅仅使用数据中分组相关信息，还使用了语义分析来构建网络，减少了网络稀疏性相关问题。First, when constructing a social network, not only the relevant information in the data grouping is used, but also semantic analysis is used to construct the network, which reduces the problems related to network sparsity.

其次，使用FTV理论框架，同时对它进行了改进，使用了位图压缩的方法，能有效的压缩索引构建的体积，加快了查询效率。Secondly, using the FTV theoretical framework and improving it at the same time, using the method of bitmap compression, it can effectively compress the volume of index construction and speed up the query efficiency.

最后，本方法有很大的兼容性，可以不仅仅应用到图查询相关领域，而且还可以应用到比如引文网络中、通信网络中的社区检测领域。Finally, the method has great compatibility and can be applied not only to graph query related fields, but also to community detection fields such as citation networks and communication networks.

附图说明Description of drawings

图1是查询抽象图；Figure 1 is the query abstraction diagram;

图2是索引验证查询结构；Figure 2 is the index verification query structure;

图3是位图字典库计算；Figure 3 is the bitmap dictionary library calculation;

图4是如何根据数据来建立关系网络；Figure 4 is how to build a relationship network based on data;

图5是利用文本语义关系来减少网络稀疏性的例子；Figure 5 is an example of using text semantic relations to reduce network sparsity;

图6是建立网络后进行查询的过滤理论框架和索引构建。Figure 6 shows the filtering theoretical framework and index construction for querying after the network is established.

具体实施方式Detailed ways

以下结合附图对本发明作进一步的说明。The present invention will be further described below in conjunction with the accompanying drawings.

本文提出的基于依存句法和泛化因果网络进行情感原因挖掘方法，主要是应用于发现文本的因果关系，找出其运行的规律。在挖掘情感原因时，可按照下面描述的步骤进行。The method of emotional cause mining based on dependency syntax and generalized causal network proposed in this paper is mainly used to discover the causal relationship of text and find out the law of its operation. When digging for emotional reasons, follow the steps described below.

本文提出的基于索引构建和社会因控网络进行社区检测的分析方法，主要是用于在海量数据规模中快速、准确的检测出具有某种结构的社区，从而进行相应的关系预测。具体建立网络的过程和构建索引查询的过程按照下面描述步骤进行。The analysis method of community detection based on index construction and social factor control network proposed in this paper is mainly used to quickly and accurately detect communities with a certain structure in the scale of massive data, so as to make corresponding relationship prediction. The specific process of establishing the network and the process of constructing the index query are performed according to the steps described below.

第一步：首先，我们先获得微博的用户博文数据和用户数据，我们主要采用pyspider进行爬取。Step 1: First, we first obtain the user blog post data and user data of Weibo, and we mainly use pyspider to crawl.

第二步：对爬取的数据清洗清洗，提取用户的账户数据、和博文数据进行整理，这一步为了去除很多无效信息，比如空账户和博文。Step 2: Clean the crawled data, extract the user's account data, and organize the blog post data. This step is to remove a lot of invalid information, such as empty accounts and blog posts.

第三步：整理博文数据，对博文数据进行语义分析，提取出博文中关键字。Step 3: Organize the blog post data, perform semantic analysis on the blog post data, and extract the keywords in the blog post.

第四步：对关键字的语义进行分析，去除掉语气词和符号。Step 4: Analyze the semantics of keywords and remove modal particles and symbols.

第五步：过滤掉无效关键词和符号后，利用分词器，区分出账户名、和相关账户名、以及名字之间关系。Step 5: After filtering out invalid keywords and symbols, use the tokenizer to distinguish the relationship between account names, related account names, and names.

第六步：把提取出有效的关键词和关系表构建因控关系网络。Step 6: Construct a cause-control relationship network by extracting effective keywords and relationship tables.

第七步：构建基于用户关系网络为主体、博文语义解析为补充的关系网络。Step 7: Build a relationship network based on the user relationship network as the main body and blog post semantic analysis as the supplement.

第八步：基于FTV过滤框架理论构建字典库，首先提取以用户节点长度不长于p(索引路径值)的指纹库进行匹配，然后匹配结果即为字典库。Step 8: Build a dictionary library based on the FTV filtering framework theory. First, extract the fingerprint library whose length of user node is not longer than p (index path value) for matching, and then the matching result is the dictionary library.

第九步：压缩字典库，方便查询，使用位图压缩算法。The ninth step: compress the dictionary library for easy query, and use the bitmap compression algorithm.

第十步：根据字典库，查询网络中的社区结构。Step 10: Query the community structure in the network according to the dictionary library.

其中，一、二、三、四、五步的实现过程是对应技术方案里的步骤一，经过步骤一，可以构建出以关联语义为载体的复杂网络。数据处理过程可以按照附图3框架来爬取数据，然后进行数据处理，语义分析构架关联语义网络可以通过附图4来进行。六、七、八、九步的实现过程对应的技术方案里的步骤二，通过这一步，可以构建出抽象查询后的索引字典库。其中字典库的数据结构如图2所示。最后第十步，可以在根据现有的查询，从索引中获取候选集，也就是特定结构的社区结构，获取的候选集的模式如附图6所示。Among them, the implementation process of steps 1, 2, 3, 4, and 5 corresponds to step 1 in the technical solution. After step 1, a complex network with associated semantics as a carrier can be constructed. The data processing process can crawl the data according to the framework of Figure 3, and then perform data processing, and the semantic analysis framework associated with the semantic network can be carried out through Figure 4. The implementation process of the sixth, seventh, eighth, and ninth steps corresponds to the second step in the technical solution. Through this step, the index dictionary library after the abstract query can be constructed. The data structure of the dictionary library is shown in Figure 2. Finally, in the tenth step, a candidate set, that is, a community structure of a specific structure, can be obtained from the index according to an existing query. The pattern of the obtained candidate set is shown in FIG. 6 .

Claims

1. The method is characterized by mainly comprising the following two steps of firstly constructing a generalized causal relationship network according to a social factor control theory, then constructing an index according to an FTV framework theory, inquiring and excavating a community structure of the network;

a causal relationship network is constructed by the social factor control theory, and the implementation steps are as follows:

step one, constructing a network;

using the constructed network, then carrying out query work based on FTV framework theory, and mining community structures in the network;

step two, constructing a dictionary structure in the network;

bitmap inclusion is performed by first dividing the fingerprints (query FQ and database FD) into long integer sequences (query L Q and database L D), and then testing the bitmap inclusion between each pair (L Q, L D) using bitmap operations (L Q ^).

2. The index building and social cause control network based analysis method for community detection according to claim 1, wherein the structure of the nodes in the dictionary repository is modified by adding new fields recording edge tags, each fingerprint is constructed in exactly the same way, however, each bit of the fingerprint obtained now is appended to the end of each corresponding bitmap.

3. The index building and social cause control network based analysis method for community detection as claimed in claim 1, wherein the parent node is constructed by performing a binary OR operation between two child nodes.

4. The index building and social cause control network based community detection analysis method of claim 1, wherein a 'fingerprint' is applied to a bitmap set, defining a distance measurement and average calculation method.