CN115185715A

CN115185715A - Case popularity diffusion processing method based on social network information

Info

Publication number: CN115185715A
Application number: CN202211107987.8A
Authority: CN
Inventors: 董卓达; 陈岩
Original assignee: Shenzhen Huayun Zhongsheng Technology Co ltd
Current assignee: Shenzhen Huayun Zhongsheng Technology Co ltd
Priority date: 2022-09-13
Filing date: 2022-09-13
Publication date: 2022-10-14

Abstract

The invention relates to the field of information processing, in particular to a case popularity diffusion processing method based on social network case information. The method comprises the steps of crawling social network case information data, collecting data forwarded and commented in the diffusion process, completing diffusion processing of the social network case information data on the basis, and improving monitoring of diffusion information in a hot case information internet.

Description

Case popularity diffusion processing method based on social network information

Technical Field

The invention relates to the field of information processing, in particular to a case popularity diffusion processing method based on social network information.

Background

As knowledge information grows in social networking, the diffusion phenomenon of social networking case information becomes more and more obvious in the face of widely spread information. In the process of the public welfare actions, case clues and related cases involved in the public welfare actions process information such as influences on the public, and the information becomes important reference basis for handling the cases in the public welfare actions.

In the face of social network case information data, how to complete the diffusion phenomenon becomes a research hotspot in the related field of the current internet law, the indexes can become the guide of the influence of the litigation of public welfare cases, and can provide case handling support for subsequent retrieval cases and the like.

Currently, digital case resources are generally already in a certain scale, and therefore, the research on social network case information data diffusion is of great importance. In practical situations, one or more pieces of information are sent from multiple message sources, and in the diffusion process of the multi-source information, the message is often published on a network by multiple users at the same time and then spread in the network. The method is characterized in that information is abstracted into events according to the characteristics of information diffusion in the social network, a large-scale event network with mass information is constructed, a distributed thought is introduced, and the detection requirement of the mass information is met, but the method does not relate to the important factor of the incidence relation of user nodes; the method is based on the research of a diffusion method of social network information data, the cooperation and competition relationship among nodes are researched from a message propagation mechanism, the influence of time on message diffusion is considered, and the calculation process is complex.

Disclosure of Invention

In order to solve one of the problems, the invention provides a case popularity diffusion processing method based on social network information, and particularly effectively completes the processing of case information data diffusion of the social network by using an association rule method.

The method comprises the steps of constructing a diffusion monitoring model, and modeling diffusion of information through a directed network G = (U, E), wherein U is a set of all nodes, and E (⊂ U × U) is a set of all arcs; for each arc

There are two parameters:

give a

At the time of day

To transmit information to

Wherein 0 is<

<1, and

wherein

>0；

Referred to as the diffusion function,

referred to as time delay parameters;

is a function of the node, edge and exchanged content characteristics;

computing node

At the time of day

To node

Sending a message

The probability of (d); the 13 interpretable features we describe below are values between 0 and 1 calculated from past information diffusion traces.

The probability is social, topical and temporalA function of nodes, edges and topic features, wherein social dimension features: rate at which each node issues messages

,

(ii) a Two groups of nodes

And

and H: (

) A Jaccard similarity coefficient of interaction; ratio of directed to undirected messages issued by each node

,

(ii) a Rate mR for each node to receive the target message (m:)

),mR(

)；

Subject dimension characteristics: interest of each user in information

，

；

Time dimension characteristics: distribution of activities per user during a day, as a non-parametric function of vector storage

,

；

Probability of diffusion

Given by the following equation, where V is the correlation vector of the feature:

estimating data describing the propagation mode of the past information in the network by using Bayesian Logistic regression to obtain

And (4) the coefficient.

The method further comprises: performing feature detection according to a diffusion graph of the diffusion event; the input parameters of the feature detection are a diffusion graph and feature coefficients of diffusion events

(ii) a The output of the algorithm is the diffusion signature of the event and the APL value.

The feature detection specifically includes:

1) Setting characteristic coefficients

；

2) Counting the degree of each node in the diffusion graph according to the adjacency list structure in the diffusion graph;

3) Counting the number of multi-branch nodes and two-branch nodes in the graph, wherein the multi-branch nodes are the node degrees more than 2, and otherwise, the two-branch nodes are the node degrees;

4) Calculating the ratio of the star nodes, and classifying the diffusion event characteristics by comparing characteristic coefficients;

5) Calculating the APL value of each connected branch of the diffusion diagram;

6) Calculate the APL value of the whole diffusion map, i.e. the value of the event's ability to diffuse.

Further preferably, the feature detection algorithm is performed in a distributed detection mode.

Further, the feature detection is performed in a manner that each type of event diffusion Map is applied to one slice to execute a plurality of reduce tasks in parallel.

Further, the diffusion model construction also comprises the steps of dividing the large social network graph into sub-graphs and then distributing each sub-graph to the process nodes; in each subgraph, there are two types of nodes: interior nodes and edge nodes; the internal node is a node with all neighbors in the subgraph; edge nodes have neighbors in other subgraphs. For each sub-graph G, all internal nodes and edges between them constitute a closed graph G; the edge node may be considered "supporting information" for updating the rule.

Further, the features of the case include: forwarding amount, comment amount, user node degree and activity degree.

Further, the case information diffusion is based on the social network case information data diffusion detection of the association rule of the characteristic information.

Further, the diffusion process of the diffusion event with the same characteristic is compared by using the Average Path Length (APL) in the graph theory, and the nodes and the edges in the event diffusion graph are stored by using an adjacency list.

Preferably, a case popularity diffusion processing system based on social network information is further provided, and the system includes a processor and a memory, the memory stores a computer program thereon, and the processor is used for executing the computer program on the memory to implement the method.

The method disclosed by the invention is used for carrying out diffusion processing on the case information data of the social network through the association rule. The method comprises the steps of crawling social network case information data, collecting data forwarded and commented in the diffusion process, completing diffusion processing of the social network case information data on the basis, and improving monitoring of diffusion information in the hot case information Internet.

Drawings

The features and advantages of the present disclosure will be more clearly understood by reference to the accompanying drawings, which are schematic and are not to be construed as limiting the disclosure in any way.

FIG. 1 is a schematic view of the event diffusion graph topology of the present method.

FIG. 2 is a schematic diagram of a data abstraction model and data structure of the method.

FIG. 3 is a schematic diagram of case information forwarding and review in the method.

Fig. 4 is a schematic input and output diagram of the detection method of the present method.

Detailed Description

These and other features and characteristics of the present disclosure, as well as the methods of operation and functions of the related elements of structure and the combination of parts and economies of manufacture, will be better understood by reference to the following description and drawings, which form a part of this specification. It is to be expressly understood, however, that the drawings are for the purpose of illustration and description only and are not intended as a definition of the limits of the disclosure. It will be understood that the figures are not drawn to scale. Various block diagrams are used in this disclosure to illustrate various variations of embodiments according to the disclosure.

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that "/" in this context means "or", for example, A/B may mean A or B; "and/or" herein is merely an association describing an associated object, and means that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone.

It should be noted that, for convenience of clearly describing the technical solutions of the embodiments of the present application, in the embodiments of the present application, words such as "first" and "second" are used to distinguish the same items or similar items with substantially the same function or action, and those skilled in the art may understand that words such as "first" and "second" do not limit the quantity and execution order. For example, the first information and the second information are for distinguishing different information, not for describing a specific order of information.

It should be noted that, in the embodiments of the present invention, words such as "exemplary" or "for example" are used to indicate examples, illustrations or explanations. Any embodiment or design described as "exemplary" or "e.g.," an embodiment of the present invention is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present concepts related in a concrete fashion.

Example 1

By using

A set of keywords in internet legal text is described, where the elements are called terms.

For describing sets of related data, called databases (databases), in which transactions (transactions)

Is a collection of items, i.e. a transaction is

Is selected from a group consisting of (a) a subset of,

then each transaction is considered to have only one identity, e.g. the transaction number is described by TID.

Wherein

And

for describing predicates or data items, the meaning of the above rule is that if the transactions are the same, if

Occur, then

Can also occur. Assuming A represents the item set and transaction T contains A, the association rule is

In which

At the same time

。

Rules

The support in the transaction data set D is defined as the ratio of the number of transactions A and B in the transaction set to the total number of transactions, and is used

Is described, i.e. is

:

=

=

(1)

Rules

The confidence in the transaction set is the ratio of the number of transactions containing A to B to the number of transactions containing A, using confidence (A)

B) Is described, i.e. is

：

Wherein the content of the first and second substances,

the transaction record number is used for describing the transaction record number containing the item set AYB;

for describing the number of transaction records containing item set a.

The support degree and the confidence degree of the rule are indexes for measuring the rule, the practicability and the certainty of the rule are reflected in sequence, and the threshold value ranges from 0% to 100%.

And (3) giving a transaction set D in the internet legal topics, and mining association rules, namely meeting the process of giving a minimum support degree min-sup and a minimum confidence coefficient min-con by a user.

In practical application, people usually only pay attention to the association rule meeting certain support degree and confidence degree, and the rule meeting both min-sup and min-con is called as a strong rule. The mining association rule problem is a problem of generating a strong rule for a certain transaction set D. The association rule mining process can be described as finding all of the transactions for one transaction database DPartial agreement

，

Associated rule of

[9]。

The detailed process of forming the association rule is as follows:

(1) Traverse frequent itemset

To obtain

All non-empty subsets s;

(2) If it is not

Then an association rule is generated "

”。

A collection of items (items) is called an Item Set (Item Set), containing

The set of items of a data item is called

A set of items. The frequency of the item set is defined as the number of transactions in D that contain the item set. And if the frequency of the item set exceeds the product of sup and the total number of the transactions in D, the item set is considered to accord with the minimum support degree min-sup. Then the set of items is called a frequent set of items. The set of frequent k-term sets is commonly written as

。

The diffusion principle of case information data is set as follows:

suppose that

For describing case information data samples, V is the fundamental domain, vj is the observed value of wj, then:

(4)

presence function

Make the information obtained through vj depend on

And if the data is diffused into v, the original case information data obtained by diffusion is distributed as follows:

the above formula can better reflect the overall law of w.

Most information related to law in the internet media is influenced by user behaviors, a divergence trend is generated in the diffusion process, and the scheme performs diffusion processing on the social network case information data through the described association rules. The method comprises the steps of crawling social network case information data, collecting data forwarded and commented in the diffusion process, and finishing diffusion processing of the social network case information data on the basis.

According to the concept of event diffusion, certain differences and connections are made with information diffusion. Information diffusion mainly refers to physical propagation of information, and nodes should be cut off to inhibit diffusion propagation; when information diffusion is researched in a multi-information social network, the diffusion of mainstream information is emphasized. The event is proposed for modeling information diffusion, and is used for packaging and abstracting information in the original information diffusion.

The method adopts Average Path Length (APL) in graph theory to compare diffusion processes of diffusion events with the same characteristics, adopts an adjacency list to store nodes and edges in an event diffusion graph, and has the following storage structure:

Struct DiffusionGraph{

borolean connected// continuity test

Long eventTimes// diffusion map event number

ArrayList EgNodes// total number of nodes of diffusion graph

Map < Long, set < Long > > nbr _ Map// Map storage structure

string Info// event content information

}

In the above storage structure, connected identifies a continuity diffusion detection result; eventtims is the total number of events; the EgNodes is the total number of nodes in the statistical graph and is used for quantifying the diffusion rate; nbr _ map stores the nodes and edges of the graph; the Info stores the event content in the diffusion map. The topology of event diffusion and its corresponding adjacency list structure are shown in fig. 1 and fig. 2.

The static characteristics of case information diffusion are obtained, in order to research the problem of case information data diffusion of the social network, the quantity of forwarded and commented legal case information is collected, and the collection interval is 15 min. The hot information is a news hotspot of a certain case issued by a certain billow microblog user, the forwarding and commenting conditions of the hot information are shown in fig. 3, and it can be seen that the time for forwarding the information by downstream users is not concentrated, and the information only meets the normal distribution to a certain extent in the overall view. Meanwhile, the amount of the information commented by the user in each time period is different, and is positively correlated with the frequency of the forwarding time periods shown in the histogram of fig. 3. The line chart of fig. 3 illustrates the quantitative trend of the above information being reviewed at various time intervals. And extracting correlation between the forwarding amount and the comment amount, and calculating by using the correlation rule based on the statistical mining correlation rule to obtain a correlation coefficient between the comment amount and the forwarding amount, wherein the correlation coefficient is 0.72. The method shows that in the process of spreading internet legal topics, the user forwarding behaviors and the comment behaviors have obvious relevance.

On one hand, in case information of the network media, a user can comment and forward one piece of case information while forwarding and commenting the information, so that the forwarding number and the comment number are increased. On the other hand, since the information value of the case itself is high, social attention is high. When the user selects the comment and forwards, the information content conforms to the user interest. Therefore, the information forwarding amount is positively correlated with the comment amount. The method comprises the steps of obtaining the correlation degree of user node degree and activity degree, wherein the node degree is the embodiment of the connection degree between a case message network node and an adjacent node and comprises two concepts of in-degree and out-degree, the in-degree is the number of concerned users, and the out-degree is the number of concerned users. With the increasing of the degree of income, the influence of the users is larger, and the published information can be browsed by more users. As the degree of departure gradually increases, the user may browse more information. The node degrees are measured through the association rules, and the greater the association degree between the association rules is, the higher the node degree is.

Table 1 shows the results of the user node degree and the entrance and exit degree counted by the association rule method according to the above samples.

TABLE 1 sample node degree, in-out degree

It can be seen that the node degree of the node a in the figure is 677, which indicates that there are 677 contacts with other nodes between the node a and the adjacent node. The in-degree value of the node a is 168, and the out-degree value is 509, that is, the user has 168 fans, and pays attention to 509, which indicates that the activity of the user in the social network is high. For the node W, the degree of entry is 6, and the degree of exit is also 6, which indicates that the node W is not active in spreading the case information, and such a user is generally considered to be a diving user.

The extraction and diffusion breadth is related to the influence of the user, in the process of researching the diffusion of the social network case information data through the association rule, the influence of the information publisher can also have great influence on the information diffusion, and the influence of the user is mainly evaluated through the user vermicelli quantity. If a user has more friends, the published information can be browsed, concerned and forwarded by more people, and the diffusion of information data is facilitated. The following is a list of statistics of the size of the hot information forwarded by different users, and the results are shown in table 2.

TABLE 2 distribution of different users to the extent of diffusion of case information data

As can be seen from table 2, the greater the user influence, the greater the number of times of forwarding the case information data, and the number of comments and the number of times of approval increase accordingly. This is because the more friends the user has, the more cases information the user has published will be browsed by more people, thereby increasing the forwarding amount and the comment amount. However, this is also because the influence of the user is high, and the number of friends is large, so that the social network case information is diffused and related to the influence.

And (5) formalizing the model. T-DZD models the diffusion of information through a directed network G = (U, E), where U is the set of all nodes and E (⊂ U × U) is the set of all arcs. For each arc

There are two parameters:

give a

At the time of day

To transmit information to

Of which 0<

<1, and

wherein

>0。

Referred to as the diffusion function,

referred to as a time delay parameter.

Is a function of the node, edge and content characteristics of the exchange. As for the Independent Cascades (IC) model, the diffusion process starts from a given set of initial activation nodes S, but disadvantageously they spread out in a continuous time. Each node active at time t

All have a chance to take a probability

Activate each inactive neighbor thereof

. If the activation is successful, the remote node is at time

Becomes active. The stop condition for the process is that no further activation can take place.

The input and output of the T-DZD are shown in fig. 4. A feature space. The model computing node

At the time of day

To the node

Sending a message

The probability of (c). This probability is a function of the node, edge and topic features belonging to social, topic and temporal dimensions. Alternatively, the 3 interpretable features described below are values between 0 and 1 calculated from past information diffusion traces.

Social dimension characteristics: rate at which each node issues messages

(ii) a Two groups of nodes

And

and H: (

) An interactive Jaccard similarity coefficient; ratio of directed to undirected messages issued by each node

(ii) a Rate mR for each node to receive the target message (m:)

)，mR(

)；

Subject dimension characteristics: interest of each user in information

，

；

Time dimension characteristics:distribution of activities per user during a day, as a non-parametric function of vector storage

；

And estimating model parameters. Probability of diffusion

And (4) the coefficient.

Illustratively, the TAP is designed with an efficient distributed learning algorithm which is implemented and tested under a Map-Reduce framework, and adopts an event characteristic detection algorithm in order to extend to a practical large-scale network.

And performing characteristic analysis according to the diffusion graph of the diffusion events. The input parameters of the algorithm are diffusion graph and characteristic coefficient of diffusion event

(ii) a The output of the algorithm is the diffusion signature of the event and the APL value. The main idea of the feature detection algorithm is as follows:

1) Setting algorithm parameters, i.e. characteristic coefficients

。

2) And counting the degree of each node in the diffusion graph according to the adjacency list structure in the diffusion graph.

3) Counting the number of multi-branch nodes and two-branch nodes in the graph, wherein the nodes with the node degree larger than 2 are the multi-branch nodes, and the nodes are the two-branch nodes on the contrary.

4) And calculating the ratio of the star nodes, and classifying the diffusion event characteristics by comparing the characteristic coefficients.

5) The APL values of each connected branch of the diffusion map are calculated.

The algorithm is executed by using the calculation of APL value and the time complexity is

Wherein m is the number of connected branches of the graph, and n is the number of nodes in the connected branch containing the largest number of nodes.

And the event Diffusion Detection algorithm adopts a Distributed Diffusion Detection (DDD) algorithm to complete the Detection of the event Diffusion process, and is based on a programming model of MapReduce. And ensuring that each type of event diffusion Map is divided into one fragment so as to execute a plurality of reduce tasks in parallel. The specific flow of the execution logic of the DDD algorithm as embodied in fig. 4 is as follows.

Since a social network may contain millions of users, and hundreds of millions of social ties between users, it is impractical to use a single machine to learn TFGs from such voluminous data. To address this challenge, we deploy learning tasks on distributed systems under the map-reduce programming model.

Map-Reduce is a programming model for distributed processing of large data sets. In the Map phase, each machine (referred to as a process node) receives a subset of the data as input and generates a set of intermediate key/value pairs. In the Reduce phase, each process node merges all intermediate values associated with the same intermediate key and outputs the final calculation result. The user specifies a mapping function that processes the key/value pairs to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

In the affinity propagation process, we first divide the large social network graph into subgraphs and then assign each subgraph to a flow node. In each subgraph, there are two types of nodes: interior nodes and edge nodes. An internal node is a node where all neighbors are in the subgraph. Edge nodes have neighbors in other subgraphs. For each sub-graph G, all internal nodes and edges between them constitute a closed graph G. The edge node may be considered "supporting information" for updating the rule. For ease of illustration, we consider a distributed learning algorithm for a single topic, so the mapping phase and the reduction phase can be defined as follows.

In the Map phase, each process node scans the closed graph G of the assigned sub-graph G. Note that each edge eij has two values aij and rij. Thus, the mapping function is defined to issue one intermediate key/value pair ei ∗/(bij + aij) for each key/value pair eij/aij; for the key/value pair eij/rij, then an intermediate key/value pair e ∗ j/rij is issued.

During the reduction phase, each process node collects all the values associated with the intermediate keys ei to generate new ri according to the equation, and all the intermediate values associated with the same key e j generate new a j according to the equation. Thus, one mapping reduction process corresponds to one iteration in our affinity propagation algorithm.

And (3) operating an event diffusion detection algorithm on the Hadoop cluster by using single file data sets with different scales. On each data set, the algorithm was executed 10 times, taking the average of the optimal 3 times as the final execution time, as shown in table 3:

TABLE 3 DDD Algorithm runtime comparison

According to the test results in the graph, the execution time of the algorithm is obviously improved along with the increase of the data scale, and when the data set reaches 1GB, the execution time of the algorithm is less than 300s. Experiments show that the method has obvious advantage in execution time when processing the social network data information.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD), a Solid State Drive (SSD), or the like; the storage medium may also comprise a combination of memories of the kind described above.

As used in this application, the terms "component," "module," "system," and the like are intended to refer to a computer-related entity, either hardware, firmware, a combination of hardware and software, or software in execution. For example, a component may be, but is not limited to being: a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of example, both an application running on a computing device and the computing device can be a component. One or more components can reside within a process and/or thread of execution and a component can be localized on one computer and/or distributed between two or more computers. In addition, these components can execute from various computer readable media having various data structures thereon. The components may communicate by way of local and/or remote processes such as in accordance with a signal having one or more data packets (e.g., data from one component interacting with another component in a local system, distributed system, and/or across a network such as the internet with other systems by way of the signal).

It should be noted that the above-mentioned embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, which should be covered by the claims of the present invention.

Claims

1. A case popularity diffusion processing method based on social network case information is characterized in that:

constructing a diffusion monitoring model, wherein the social network case information is modeled by diffusion of a directed network G = (U, E), wherein U is a set of all nodes, and E (⊂ U × U) is a set of all arcs; for each arc

There are two parameters:

give a

At the time of day

To transmit information to

Wherein 0 is<

<1, and

wherein

>0；

Referred to as the diffusion function,

referred to as time delay parameters;

is a function of the node, edge and content characteristics of the exchange;

computing node

At the time of day

To the node

Sending a message

The probability of (d);

probability is a function of the node, edge and topic features belonging to social, topic and temporal dimensions, where the social dimension features: rate at which each node issues messages

,

(ii) a Two groups of nodes

And

and H: (

,

(ii) a Rate mR for each node to receive the target message (m:)

),mR(

)；

Subject dimension characteristics: interest of each user in information

，

；

,

；

Probability of diffusion

And (4) the coefficient.

2. The method of claim 1, wherein: the method further comprises performing feature detection based on a diffusion map of diffusion events; the input parameters of the feature detection are a diffusion graph and feature coefficients of diffusion events

(ii) a The output is the diffusion signature and APL value of the event.

3. The method of claim 2, wherein: the feature detection specifically includes:

1) Setting characteristic coefficients

；

5) Calculating the APL value of each connected branch of the diffusion diagram;

4. The method of claim 3, wherein: the feature detection is completed in a distributed detection mode.

5. The method of claim 4, wherein: the feature detection is performed in a manner that a plurality of reduce tasks are executed in parallel by adopting each type of event diffusion Map to one fragment.

6. The method of claim 5, wherein: the construction of the diffusion monitoring model further comprises the steps of dividing the large social network graph into subgraphs, and then distributing each subgraph to the process nodes; in each subgraph, there are two types of nodes: interior nodes and edge nodes; the internal node is a node with all neighbors in the subgraph; the edge node has neighbors in other subgraphs; for each sub-graph G, all internal nodes and edges between them constitute a closed graph G.

7. The method of claim 6, wherein: the social networking case information data includes: forwarding amount, comment amount, user node degree and activity degree.

8. The method of claim 6, wherein: the diffusion of case information is based on the detection of the social network case information data diffusion based on the association rule of the characteristic information.

9. The method of claim 8, wherein: and comparing the diffusion processes of the diffusion events with the same characteristic by adopting the average path length APL in the graph theory, and storing the nodes and the edges in the event diffusion graph by adopting an adjacent table.

10. The method of any one of claims 1-9, wherein the method is applied to information processing of a fair litigation case.