CN117891857B - Data mining method and system based on big data - Google Patents

Data mining method and system based on big data Download PDF

Info

Publication number
CN117891857B
CN117891857B CN202410284239.XA CN202410284239A CN117891857B CN 117891857 B CN117891857 B CN 117891857B CN 202410284239 A CN202410284239 A CN 202410284239A CN 117891857 B CN117891857 B CN 117891857B
Authority
CN
China
Prior art keywords
data
association
time
nodes
article
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410284239.XA
Other languages
Chinese (zh)
Other versions
CN117891857A (en
Inventor
洪永霖
冯逸华
许丽清
宋玮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong University of Technology
Original Assignee
Guangdong University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong University of Technology filed Critical Guangdong University of Technology
Priority to CN202410284239.XA priority Critical patent/CN117891857B/en
Publication of CN117891857A publication Critical patent/CN117891857A/en
Application granted granted Critical
Publication of CN117891857B publication Critical patent/CN117891857B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of information retrieval, in particular to a data mining method and system based on big data, comprising the following steps: and analyzing the time stamp of the original data set, segmenting the data, wherein each segment represents a data snapshot in a target time interval, identifying key features and fluctuation modes in multiple time periods, and generating a time segmentation feature set. In the invention, key characteristics and fluctuation modes of data in multiple time periods are extracted through the combination of timestamp analysis and a sliding window technology, so that a dynamic association rule set is constructed, and the dynamic tracking capacity of data mining and the accuracy of association analysis are obviously enhanced. Based on the fitting degree comparison preset threshold, judging whether a judging logic of potential connection exists or not, accurately predicting new association among articles or enhancing the possibility of existing association, and improving the relevance and accuracy of information retrieval through refined community structure analysis and node fitting degree calculation.

Description

Data mining method and system based on big data
Technical Field
The invention relates to the technical field of information retrieval, in particular to a data mining method and system based on big data.
Background
A data mining method based on big data belongs to the technical field of information retrieval, and is a field of mining valuable information from a batch data set by using computer science and a statistical method. The primary goal of information retrieval technology is to organize, index, and search data to provide quick and accurate information searches. This technical field is not only focused on the processing of text data, but also the searching and analysis of image, video and audio data. With the development of the internet and digital storage technology, information retrieval has become one of the key technologies in the big data age, and supports the implementation of search engines, recommendation systems, data analysis tools and other applications.
The data mining method based on big data refers to a process of extracting information and patterns from a huge, complex data set by using a data mining technology. The aim is to mine useful information by analyzing and understanding large amounts of data, support decision making, predict future trends, or identify potential associations in the data. The method aims at extracting knowledge from the original data, helping individuals and organizations better understand the meaning behind the data, and further achieving the effects of optimizing business processes, improving efficiency, enhancing user experience and the like.
The reliance of traditional data mining methods on static data processing and simple association rule analysis makes them lacking the ability to dynamically track and fine-grained analysis in the face of large-scale, complex data sets. The method can not fully utilize time sequence data and complex community structure information, so that the accuracy and predictability of a correlation analysis result are insufficient, and rapid change of a data environment can not be effectively realized. This deficiency affects the effectiveness of information retrieval, limits the application of data mining techniques in decision support and potential correlation recognition, and reduces the quality and timeliness of decision making.
Disclosure of Invention
The invention aims to solve the defects in the prior art, and provides a data mining method and system based on big data.
In order to achieve the above purpose, the present invention adopts the following technical scheme: the data mining method based on big data comprises the following steps:
s1: analyzing the time stamp of the original data set, segmenting the data, wherein each segment represents a data snapshot in a target time interval, identifying key features and fluctuation modes in multiple time periods, and generating a time segmentation feature set;
S2: applying a sliding window technology to the time segment feature set, adjusting the size and the step length of a window, analyzing the data change in each window, identifying the change relation among articles, mining the evolution of the change relation along with time, and constructing a dynamic association rule set;
S3: converting the dynamic association rule set into a set of nodes and edges, wherein each article is used as a node, the association rule is used as an edge, the weight of the edge between the nodes is used for representing association strength, and an article relation diagram reflecting interaction between the articles is constructed;
s4: applying community discovery logic on the article relation graph, grouping similar or related nodes according to the connection density and mode of the nodes, revealing the inherent relevance among differentiated article groups, and obtaining an article community structure;
S5: evaluating potential future connection by utilizing the node and the side information in the article community structure, analyzing the estimated new association of articles inside and outside the community or enhancing the probability of the existing association, and obtaining a predicted association strength result;
S6: and combining the data in the predicted association strength result with a dynamic association rule set, analyzing the development trend of the object relationship along with time, evaluating the stability and the change speed of the relationship, and establishing relationship dynamic trend analysis.
As a further scheme of the invention, the time segmentation feature set comprises time segment division, activity frequency and key fluctuation mode, the dynamic association rule set is specifically item combination change, purchasing frequency trend and target combination occurrence frequency, the item relation graph comprises node definition, edge weight and node aggregation degree, the item community structure comprises item grouping set, internal connection density and community independence, the predicted association strength result comprises new item combination potential, relationship strengthening possibility and community prediction trend, and the relationship dynamic trend analysis is specifically stability analysis, change speed and future trend prediction.
As a further aspect of the present invention, analyzing a time stamp of an original data set, segmenting the data, each segment representing a snapshot of the data within a target time interval, identifying key features and fluctuation patterns within multiple time periods, and generating a time segment feature set specifically includes:
S101: screening time stamp information in an original data set, sorting according to a time sequence, setting time interval parameters, dividing the data set into a plurality of continuous time periods, wherein each time period comprises data records in the time period, and acquiring a time period data set;
S102: traversing the time period data set, calculating key indexes in each time period, including average value, maximum value and minimum value, calculating fluctuation indexes of data, including standard deviation, and identifying data distribution characteristics and fluctuation modes in each time period to generate time period characteristic indexes;
S103: based on the time period characteristic indexes, characteristic changes among multiple time periods are analyzed, a continuous change trend or a periodic fluctuation mode is identified, and a time segment characteristic set reflecting data changes in the multiple time periods is constructed by combining key events or target date information in the time periods.
As a further scheme of the invention, a sliding window technology is applied to the time segment feature set, the window size and the step length are adjusted, the data change in each window is analyzed, the change relation among articles is identified, the evolution of the change relation along with time is mined, and the step of constructing a dynamic association rule set is specifically as follows:
s201: selecting the time segment feature set as initial data, determining the initial sliding window size and step length, and applying sliding window division to the data set so that each window stores continuous time sequence data and a window sequence set is generated;
s202: analyzing the data of each window in the window sequence set, calculating the statistical index of each window, including the average value and variance of the data, identifying and marking the key data change and the characteristics in the window, and generating a window characteristic overview;
S203: using the window feature overview to compare feature differences in the differential window, identifying the mode and relation of data change, and constructing a change relation map;
S204: and analyzing the dynamic association of the time variation among the articles according to the variation relation map, and constructing a dynamic association rule set reflecting the time evolution by integrating and optimizing the dynamic association information.
As a further scheme of the present invention, the dynamic association rule set is converted into a set of nodes and edges, each article is used as a node, the association rule is used as an edge, the weight of the edge between the nodes is used to represent the association strength, and the step of constructing the article relationship graph reflecting the interaction between the articles is specifically as follows:
S301: based on the dynamic association rule set, identifying association rules among multiple articles, defining each article as an independent node, converting each association rule into an edge connected with the corresponding article, and jointly deciding the preliminary weight of the edge by the support degree and the confidence degree of the association rule to generate an article node set and a preliminary association edge set;
S302: based on the article node set and the preliminary association edge set, re-calculating the weight of each edge, and adjusting the preliminary weight by adopting a proportionality coefficient so as to feed back the association strength and the directionality among articles and obtain an optimized association edge set;
S303: and constructing an interaction diagram among the whole articles by using the optimized association edge set, so that each node represents one article, each edge represents the association between two articles, the weight of the edge represents the association strength, and an article relation diagram is built.
As a further scheme of the invention, community discovery logic is applied to the article relation diagram, the nodes of the same kind or association are grouped according to the connection density and mode of the nodes, the inherent association among differentiated article groups is revealed, and the steps for obtaining the article community structure are specifically as follows:
s401: collecting nodes and connection information in the article relation graph, drawing each article to be used as a node by using a graphic representation method, and constructing an article network graph by using the interrelationship among the articles as edges;
S402: calculating connection density among nodes based on the object network graph, marking a region with high connection density as a potential community, gradually adding edge nodes until a new node cannot be added any more, and forming a community primary partition;
s403: optimizing the preliminary division of the communities, adjusting the connection density threshold and the community size standard, and improving the internal consistency and the external distinction degree of the communities through merging or dividing operation to obtain the community structure of the object.
As a further scheme of the invention, the node and the side information in the article community structure are utilized to evaluate potential future connection, the probability of new association or existing association strengthening estimated by articles inside and outside the community is analyzed, and the step of obtaining the predicted association strength result is specifically as follows:
S501: calculating the number of common neighbors between each pair of nodes by using nodes and side information in the article community structure and adopting a network analysis method, evaluating the fitting degree between the nodes based on the number of the common neighbors, and considering the two nodes as having potential connection if the fitting degree exceeds a preset threshold value, so as to generate a node fitting degree matrix;
The network analysis method is as follows
Calculating the number of weighted common neighbors between each pair of nodes, generating a node fitness matrix, wherein N (i) and N (j) represent the neighbor sets of node i and node j,、/>、/>And/>As weight coefficients, si and Sj are connection strengths of nodes, ifc is an information flow coefficient, savg is an average network connection strength, and Cavg is an average aggregation coefficient;
s502: screening node pairs with fitting degree higher than a target threshold based on the node fitting degree matrix, carrying out community attribution analysis on the node pairs, if two nodes belong to the same community, improving the potential connection probability of the two nodes, if the two nodes do not belong to the same community, reducing the potential connection probability of the two nodes, and constructing a potential connection probability table;
S503: and scoring and sequencing the potential connection strength of the node pairs according to the potential connection probability table, determining the potential connection strength of each pair of nodes, obtaining a sequenced node pair list, and generating a predicted association strength result.
As a further scheme of the invention, the data in the predicted association strength result and the dynamic association rule set are combined, the development trend of the object relationship along with time is analyzed, the stability and the change speed of the relationship are evaluated, and the step of establishing the relationship dynamic trend analysis is specifically as follows:
S601: extracting the association strength value of each pair of articles at the differentiated time point from the predicted association strength result, grouping the association strength values by using a time tag, wherein each group of data represents the article relationship condition in a target time period, thereby constructing a time sequence association matrix;
S602: calculating the change trend of the article relation strength for the time sequence incidence matrix in each time period, wherein the change trend comprises the average incidence strength, the highest incidence strength and the lowest incidence strength of each time period and the change rate of the average incidence strength, the highest incidence strength and the lowest incidence strength of each time period, so as to obtain an incidence strength change index of each time period;
S603: and analyzing the development trend of the object relationship in a plurality of time periods by using the association strength change indexes, and predicting the stability and the change speed of the object relationship in the future by combining the historical change rules to establish the relationship dynamic trend analysis.
The system comprises a time sequence analysis module, a feature mining module, a rule dynamic updating module, a relation network mapping module, a community connection identification module, a correlation strength prediction module and a trend prediction and analysis module;
The time sequence analysis module extracts time stamps from the original data, sorts the time stamps according to time sequence, sets time intervals, divides the data into continuous time periods, calculates basic statistics of the data in multiple time periods, including total number, average value and extremum, and generates a time sequence data snapshot;
The feature mining module analyzes the time sequence data snapshot, identifies data fluctuation and key indexes in each time period, extracts feature changes and key fluctuation modes by combining the changes of the data in multiple time periods, and acquires time period feature analysis;
The rule dynamic updating module applies a sliding window dividing technology to analyze the characteristics of the time period, gradually adjusts the window size and the step length, analyzes the change and the mode of data in the continuous time period, extracts the change relation and the trend, and generates a dynamic association model;
The relation network mapping module maps a plurality of articles and variation relations thereof into a form of a graph by using a dynamic association model, the articles are used as nodes, the variation relations are converted into edges, and the weights of the edges are adjusted to reflect the association strength, so that an article relation network is constructed;
The community connection identification module analyzes the article relation network, identifies connected node groups by utilizing the network structure characteristic, groups the node groups according to the connection density, and defines the groups as differentiated communities to obtain a community structure diagram;
The association strength prediction module is used for evaluating potential connection probabilities of nodes in multiple communities and the nodes based on a community structure diagram, predicting new association or reinforcing association by referring to common attributes and historical interactions among the nodes, and generating an association prediction matrix;
The trend prediction and analysis module analyzes data in the association prediction matrix, tracks the evolution of the article relationship between the differentiated time points, identifies the trend of the relationship change, including stability and development speed, and establishes a trend analysis framework.
Compared with the prior art, the invention has the advantages and positive effects that:
In the invention, key characteristics and fluctuation modes of data in multiple time periods are extracted through the combination of timestamp analysis and a sliding window technology, so that a dynamic association rule set is constructed, and the dynamic tracking capacity of data mining and the accuracy of association analysis are obviously enhanced. Based on the fitting degree comparison preset threshold, judging whether a judging logic of potential connection exists or not, accurately predicting new association among articles or enhancing the possibility of existing association, and improving the relevance and accuracy of information retrieval through refined community structure analysis and node fitting degree calculation.
Drawings
FIG. 1 is a schematic workflow diagram of the present invention;
FIG. 2 is a S1 refinement flowchart of the present invention;
FIG. 3 is a S2 refinement flowchart of the present invention;
FIG. 4 is a S3 refinement flowchart of the present invention;
FIG. 5 is a S4 refinement flowchart of the present invention;
FIG. 6 is a S5 refinement flowchart of the present invention;
FIG. 7 is a S6 refinement flowchart of the present invention;
Fig. 8 is a system flow diagram of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
In the description of the present invention, it should be understood that the terms "length," "width," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like indicate orientations or positional relationships based on the orientation or positional relationships shown in the drawings, merely to facilitate describing the present invention and simplify the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and therefore should not be construed as limiting the present invention. Furthermore, in the description of the present invention, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.
Embodiment one: referring to fig. 1, the present invention provides a technical solution: the data mining method based on big data comprises the following steps:
s1: analyzing the time stamp of the original data set, segmenting the data, wherein each segment represents a data snapshot in a target time interval, identifying key features and fluctuation modes in multiple time periods, and generating a time segmentation feature set;
S2: applying a sliding window technology to the time segment feature set, adjusting the size and the step length of the window, analyzing the data change in each window, identifying the change relation among articles, mining the evolution of the change relation along with the time, and constructing a dynamic association rule set;
S3: converting the dynamic association rule set into a set of nodes and edges, wherein each article is used as a node, the association rule is used as an edge, the weight of the edge between the nodes is used for representing the association strength, and an article relation diagram reflecting the interaction between the articles is constructed;
S4: applying community discovery logic on the article relation graph, grouping similar or related nodes according to the connection density and mode of the nodes, revealing the inherent relevance among differentiated article groups, and obtaining an article community structure;
S5: evaluating potential future connection by utilizing node and side information in the article community structure, analyzing the estimated new association of articles inside and outside the community or enhancing the probability of the existing association, and obtaining a predicted association strength result;
s6: and combining the data in the predicted association strength result with a dynamic association rule set, analyzing the development trend of the object relationship along with time, evaluating the stability and the change speed of the relationship, and establishing relationship dynamic trend analysis.
The time segment feature set comprises time segment division, activity frequency and key fluctuation mode, the dynamic association rule set comprises article combination change, purchasing frequency trend and target combination occurrence frequency, the article relation graph comprises node definition, edge weight and node concentration, the article community structure comprises article grouping set, internal connection density and community independence, the predicted association strength result comprises new article combination potential, relationship strengthening possibility and community prediction trend, and the relationship dynamic trend analysis comprises stability analysis, change speed and future trend prediction.
Referring to fig. 2, the time stamp of the original data set is analyzed, the data is segmented, each segment represents a data snapshot in a target time interval, key features and fluctuation modes in multiple time periods are identified, and the step of generating the time segment feature set specifically includes:
S101: screening time stamp information in an original data set, sorting according to a time sequence, setting time interval parameters, dividing the data set into a plurality of continuous time periods, wherein each time period comprises data records in the time period, and acquiring a time period data set;
Based on an original dataset, programming by using a Python language, importing data by using a read_csv function in a Pandas library, converting a character string timestamp in the data into a DateTime object by using a to_ DateTime function, sequencing the DateTime object, performing time segmentation processing on the dataset by setting a time interval parameter of a resample method to be 'H' (hours), and generating a time period dataset by each time period including data records in the time period.
S102: traversing the time period data set, calculating key indexes in each time period, including average value, maximum value and minimum value, calculating fluctuation indexes of data, including standard deviation, and identifying data distribution characteristics and fluctuation modes in each time period to generate time period characteristic indexes;
traversing the time period data set, calculating an average value, a maximum value and a minimum value in each time period by using a NumPy library by adopting a statistical method, and calculating a standard deviation by using a std function so as to identify the data distribution characteristics and the fluctuation modes in each time period and generate a time period characteristic index.
S103: based on the time period characteristic indexes, characteristic changes among multiple time periods are analyzed, a continuous change trend or a periodic fluctuation mode is identified, and a time segment characteristic set reflecting data changes in the multiple time periods is constructed by combining key events or target date information in the time periods.
Based on the time period characteristic index, a time sequence analysis technology is adopted, an ARIMA model (AutoRegressive Integrated Moving Average model) in a Statsmodels library is utilized, preset model parameters are (p=2, d=1, q=2) which respectively represent the order of an autoregressive term, the differential order and the order of a moving average term, time period data are fitted, a time sequence prediction is carried out through a fit method of the model, a continuous change trend or a periodic fluctuation mode is identified, and a time segment characteristic set reflecting data change in multiple time periods is generated by combining key events or target date information in the time periods.
Referring to fig. 3, a sliding window technique is applied to a time-segment feature set, the window size and step size are adjusted, the data change in each window is analyzed, the change relation between articles is identified, the evolution of the change relation with time is mined, and the steps of constructing a dynamic association rule set are specifically as follows:
S201: selecting a time segment feature set as initial data, determining the size and the step length of an initial sliding window, and applying sliding window division to a data set so that continuous time sequence data exists in each window to generate a window sequence set;
Based on the time-division feature set, a Python programming language is adopted, a Pandas library is utilized to apply sliding window division to the data set, window parameters of a rolling method are set to be 7D to represent window sizes for 7 days, the min_period parameters are set to be 1, it is ensured that at least one day of data can generate a window, and step parameters are set to be 1D to represent window sliding for one day each time, so that a window sequence set of continuous time sequence data is generated.
S202: analyzing the data of each window in the window sequence set, calculating the statistical index of each window, including the average value and the variance of the data, identifying and marking the key data change and the characteristics in the window, and generating a window characteristic overview;
analyzing the data of each window in the window sequence set, calculating the average value and variance of each window by using a NumPy library by adopting a statistical method, respectively calculating by means function and var function, identifying and marking key data changes and features in the windows, and generating a window feature overview.
S203: using a window feature overview, comparing feature differences in the differential windows, identifying a mode and a relation of data change, and constructing a change relation map;
Using a window characteristic overview, adopting a graph theory analysis method, constructing a variation relation graph by utilizing a NetworkX library, respectively adding nodes and edges by using an add_nodes_from method and an add_edges_from method, wherein the nodes represent the windows, the edges represent characteristic differences among the windows, using Jaccard coefficients as the weights of the edges, comparing the characteristic differences in the differential windows, identifying the mode and the relation of data variation, and constructing the variation relation graph.
S204: according to the variation relation map, analyzing the dynamic association of the time variation among the articles, and constructing a dynamic association rule set reflecting the time evolution by integrating and optimizing the dynamic association information.
According to the fluctuation relation graph, a community discovery algorithm is adopted, community division based on a Louvain method is realized by using a community library of Python, dynamic association of the objects along with time change is analyzed by the method, dynamic association information is integrated and optimized, and a dynamic association rule set reflecting time evolution is constructed.
Referring to fig. 4, converting a dynamic association rule set into a set of nodes and edges, wherein each item is used as a node, the association rule is used as an edge, the weight of the edge between the nodes is used to represent the association strength, and the step of constructing an item relation graph reflecting the interaction between the items is specifically as follows:
S301: based on a dynamic association rule set, identifying association rules among multiple articles, defining each article as an independent node, converting each association rule into an edge connected with the corresponding article, and jointly deciding the preliminary weight of the edge by the support and the confidence of the association rule to generate an article node set and a preliminary association edge set;
Based on a dynamic association rule set, a graph theory analysis method is adopted, each article is defined as an independent node by using a Python language and NetworkX library, each association rule is converted into an edge connected with the corresponding article, nodes and edges are respectively added to the graph through an add_node method and an add_edge method, the preliminary weights of the edges are jointly decided by the support degree and the confidence degree of the association rule, the support degree of the association rule is used as part of the edge weights, the confidence degree is used as the other part of the edge weights, and the weight calculation formula w=alpha×support degree+beta×confidence degree is adopted, wherein alpha and beta are weight adjustment coefficients, the preliminary weights of the edges are determined, and the article node set and the preliminary association edge set are generated.
S302: based on the article node set and the preliminary association edge set, recalculating the weight of each edge, and adjusting the preliminary weight by adopting a proportionality coefficient so as to feed back the association strength and the directionality among articles and obtain an optimized association edge set;
Based on the article node set and the preliminary association edge set, a weight adjustment strategy is adopted, the weight of each edge is recalculated by using Python language and NetworkX library, the preliminary weight is adjusted by using a proportionality coefficient gamma, the association strength and directivity among the anti-spring articles are adjusted by the formula w ' =gamma×ww ' =gamma×w, wherein w ' is the adjusted weight, gamma is the proportionality coefficient, the original weight is strengthened or weakened, and the optimized association edge set is obtained.
S303: and constructing an interaction diagram among the whole articles by utilizing the optimized association edge set, so that each node represents one article, each edge represents the association between two articles, the weight of the edge represents the association strength, and an article relation diagram is established.
And constructing an interaction diagram among the whole articles by using the optimized association edge set, a graph theory construction method and a Python language and NetworkX library, representing each node by an add_node and an add_edge method, representing the association between two articles by each edge, representing the association strength by the weight of the edge, representing the association strength among the articles by using the value after the adjustment of the weight of the edge, and constructing an article relationship diagram.
Referring to fig. 5, community discovery logic is applied to an article relationship diagram, and nodes of the same type or related type are grouped according to the connection density and mode of the nodes, so as to reveal the inherent relevance among differentiated article groups, and the step of obtaining an article community structure is specifically as follows:
S401: collecting nodes and connection information in the article relation graph, drawing each article as a node by using a graphic representation method, and constructing an article network graph by using the interrelationship among the articles as edges;
Collecting node and connection information in an article relation diagram, adopting a visual drawing method, drawing each article to be used as a node by using Python language and Matplotlib library, drawing the interrelation between the articles to be used as an edge, respectively drawing the node and the edge by using the functions of draw_ networkx _nodes and draw_ networkx _edges of NetworkX library, and automatically calculating the positions of the nodes by using a spring_layout algorithm to optimize visual effects, so as to construct an article network diagram.
S402: calculating connection density among nodes based on the object network graph, marking a region with high connection density as a potential community, gradually adding edge nodes until a new node cannot be added any more, and forming a community primary partition;
based on the object network graph, a graph theory analysis method is adopted, the connection density between nodes is calculated by using Python language and NetworkX library, the connection density of the whole graph is calculated through a density function, a region with high connection density is identified as a potential community, and the edge nodes are gradually added by using a grey modularity _communities function until a new node cannot be added, so that community preliminary division is formed.
S403: optimizing the preliminary division of communities, adjusting the connection density threshold and the community size standard, and improving the internal consistency and the external distinction degree of communities through merging or segmentation operation to obtain the community structure of the object.
Optimizing preliminary division of communities, adopting a community adjustment algorithm, adjusting a connection density threshold and a community size standard by using a Python language and a community library through a Louvain method, calculating community modularity through modularity functions of the method to improve internal consistency and external distinction degree of communities, and performing merging or segmentation operation on the basis of modularity _max functions to obtain an object community structure.
Referring to fig. 6, the step of evaluating potential future connection by using node and side information in the article community structure, analyzing the probability of new association estimated by articles inside and outside the community or reinforcing existing association, and obtaining the predicted association strength result specifically includes:
S501: calculating the number of common neighbors between each pair of nodes by using nodes and side information in an article community structure and adopting a network analysis method, evaluating the fitting degree between the nodes based on the number of the common neighbors, and if the fitting degree exceeds a preset threshold, considering the two nodes as having potential connection so as to generate a node fitting degree matrix;
The network analysis method is as follows
Calculating the number of weighted common neighbors between each pair of nodes, generating a node fitness matrix, wherein N (i) and N (j) represent the neighbor sets of node i and node j,、/>、/>And/>Si and Sj are the connection strength of the nodes, ifc is the information flow coefficient, savg is the average connection strength of the network, and Cavg is the average aggregation coefficient.
The execution process comprises the steps of calculating the number of the common neighbors between each pair of nodes, considering the connection strength of each pair of nodes, the ratio of the information flow coefficient between the nodes to the average connection strength of the network and the average aggregation coefficient of the network, and improving the accuracy of predicting the potential connection through the weighted sum of the parameters.
And calculating the number of the shared neighbors between each pair of nodes by using nodes and side information in the article community structure and a network analysis method and using a Python language and NetworkX library, and evaluating the fitting degree between the nodes based on the number of the shared neighbors by using a common_neighbors function, and if the fitting degree exceeds a preset threshold, considering the two nodes as having potential connection, wherein the preset threshold is determined by analyzing the average number of the shared neighbors of the network, so as to generate a node fitting degree matrix.
S502: based on the node fitness matrix, screening node pairs with the fitness higher than a target threshold, carrying out community attribution analysis on the node pairs, if two nodes belong to the same community, improving the potential connection probability of the two nodes, and if the two nodes do not belong to the same community, reducing the potential connection probability of the two nodes, and constructing a potential connection probability table;
Based on the node fitness matrix, a threshold screening method is adopted, node pairs with fitness higher than a target threshold are screened by using Python language and Pandas library, the target threshold is obtained by dynamic calculation according to the average fitness and standard deviation of a network, community attribution analysis is carried out on the node pairs, if two nodes belong to the same community, the potential connection probability of the two nodes is improved, and if the two nodes do not belong to the same community, the potential connection probability of the two nodes is reduced, and a potential connection probability table is constructed.
S503: and scoring and sequencing the potential connection strength of the node pairs according to the potential connection probability table, determining the potential connection strength of each pair of nodes, obtaining a sequenced node pair list, and generating a predicted association strength result.
And according to the potential connection probability table, grading and grading the potential connection strength of the node pairs by using a Python language and Pandas library, and determining the potential connection strength of each pair of nodes by calculating the potential connection probability of each pair of nodes and the position correlation score of the potential connection probability of each pair of nodes in the community structure to obtain a sorted node pair list, so as to generate a predicted association strength result.
Referring to fig. 7, in combination with the data in the predicted association strength result and the dynamic association rule set, the method analyzes the development trend of the object relationship with time, evaluates the stability and the change speed of the relationship, and establishes the relationship dynamic trend analysis specifically includes:
s601: extracting the association strength value of each pair of articles at the differentiated time point from the predicted association strength result, grouping the association strength values by using a time tag, wherein each group of data represents the article relationship condition in a target time period, thereby constructing a time sequence association matrix;
Based on the predicted association strength result, a time sequence analysis method is adopted, the data are processed by utilizing a pandas library of Python, association strength values of each pair of articles at different time points are extracted, the association strength values are grouped by using a DataFrame.
S602: calculating the change trend of the article relation strength for the time sequence incidence matrix in each time period, wherein the change trend comprises the average incidence strength, the highest incidence strength and the lowest incidence strength of each time period and the change rate of the average incidence strength, the highest incidence strength and the lowest incidence strength of each time period, so as to obtain an incidence strength change index of each time period;
For a time sequence incidence matrix in each time period, a statistical analysis method is adopted, a numpy library is used for calculation, the change trend of the article relation strength is calculated, the average incidence strength of each time period is calculated by using numpy.mean (), the highest and lowest incidence strengths are calculated by using numpy.max () and numpy.min (), and the change rate is calculated by using (numpy.max () -numpy.min ()/numpy.mean (), so that the incidence strength change index of each time period is generated.
S603: and analyzing the development trend of the object relationship in a plurality of time periods by using the association strength change indexes, and predicting the stability and the change speed of the object relationship in the future by combining the historical change rule to establish the relationship dynamic trend analysis.
And (3) utilizing a correlation strength change index, adopting a linear regression model, constructing a model by using scikit-learn library, analyzing the development trend of the object relationship in a plurality of time periods, setting model parameters as default values in combination with historical change rules, training a data set by using fit (X, y) for model training, wherein X is a time sequence, y is a correlation strength change index, predicting the stability and change speed of the object relationship in the future, and generating a relationship dynamic trend analysis.
Referring to fig. 8, the big data-based data mining system is configured to perform the big data-based data mining method, where the system includes a time sequence analysis module, a feature mining module, a rule dynamic update module, a relational network mapping module, a community connection identification module, a correlation strength prediction module, and a trend prediction and analysis module;
the time sequence analysis module extracts time stamps from the original data, sorts the time stamps according to time sequence, sets time intervals, divides the data into continuous time periods, calculates basic statistics of the data in multiple time periods, including total number, average value and extremum, and generates a time sequence data snapshot;
The feature mining module analyzes the time sequence data snapshot, identifies data fluctuation and key indexes in each time period, extracts feature changes and key fluctuation modes by combining the changes of the data in multiple time periods, and acquires time period feature analysis;
The rule dynamic updating module applies a sliding window dividing technology to analyze the characteristics of the time period, gradually adjusts the size and the step length of a window, analyzes the change and the mode of data in the continuous time period, extracts the change relation and the trend, and generates a dynamic association model;
The relation network mapping module maps a plurality of articles and variation relations thereof into a form of a graph by using a dynamic association model, the articles are used as nodes, the variation relations are converted into edges, and the weights of the edges are adjusted to reflect the association strength, so that an article relation network is constructed;
The community connection identification module analyzes the article relation network, identifies connected node groups by utilizing the network structure characteristic, groups the node groups according to the connection density, and defines the groups as differentiated communities to obtain a community structure diagram;
The association strength prediction module is used for evaluating potential connection probabilities of nodes in the multiple communities and the multiple communities based on a community structure diagram, predicting new association or reinforcing association by referring to common attributes and historical interactions among the nodes, and generating an association prediction matrix;
the trend prediction and analysis module analyzes data in the association prediction matrix, tracks the evolution of the article relationship between the differentiated time points, identifies the trend of the relationship change, including stability and development speed, and establishes a trend analysis framework.
The present invention is not limited to the above embodiments, and any equivalent embodiments which can be changed or modified by the technical disclosure described above can be applied to other fields, but any simple modification, equivalent changes and modification made to the above embodiments according to the technical matter of the present invention will still fall within the scope of the technical disclosure.

Claims (8)

1. The data mining method based on big data is characterized by comprising the following steps of:
Analyzing the time stamp of the original data set, segmenting the data, wherein each segment represents a data snapshot in a target time interval, identifying key features and fluctuation modes in multiple time periods, and generating a time segmentation feature set;
applying a sliding window technology to the time segment feature set, adjusting the size and the step length of a window, analyzing the data change in each window, identifying the change relation among articles, mining the evolution of the change relation along with time, and constructing a dynamic association rule set;
converting the dynamic association rule set into a set of nodes and edges, wherein each article is used as a node, the association rule is used as an edge, the weight of the edge between the nodes is used for representing association strength, and an article relation diagram reflecting interaction between the articles is constructed;
applying community discovery logic on the article relation graph, grouping similar or related nodes according to the connection density and mode of the nodes, revealing the inherent relevance among differentiated article groups, and obtaining an article community structure;
evaluating potential future connection by utilizing the node and the side information in the article community structure, estimating the probability of new association or reinforcing the existing association of articles inside and outside the community, and analyzing the probability to obtain a predicted association strength result;
combining the data in the predicted association strength result with a dynamic association rule set, analyzing the development trend of the object relationship along with time, evaluating the stability and the change speed of the relationship, and establishing relationship dynamic trend analysis;
the method comprises the steps of evaluating potential future connection by utilizing nodes and side information in the article community structure, estimating the probability of new association or reinforcing existing association of articles inside and outside the community, and analyzing the probability to obtain a predicted association strength result, wherein the steps are as follows:
Calculating the number of common neighbors between each pair of nodes by using nodes and side information in the article community structure and adopting a network analysis method, evaluating the fitting degree between the nodes based on the number of the common neighbors, and considering the two nodes as having potential connection if the fitting degree exceeds a preset threshold value, so as to generate a node fitting degree matrix;
Screening node pairs with fitting degree higher than a target threshold based on the node fitting degree matrix, carrying out community attribution analysis on the node pairs, if two nodes belong to the same community, improving the potential connection probability of the two nodes, if the two nodes do not belong to the same community, reducing the potential connection probability of the two nodes, and constructing a potential connection probability table;
scoring and sorting the potential connection strength of the node pairs according to the potential connection probability table, determining the potential connection strength of each pair of nodes, obtaining a sorted node pair list, and generating a predicted association strength result;
The network analysis method is as follows
Calculating the number of weighted common neighbors between each pair of nodes to generate a node fitness matrix, wherein N (i) and N (j) represent neighbor sets of the node i and the node j,/>、/>、/>And/>Si and Sj are the connection strength of the nodes, ifc is the information flow coefficient, savg is the average connection strength of the network, and Cavg is the average aggregation coefficient.
2. The big data based data mining method according to claim 1, wherein the time segment feature set comprises time segment division, activity frequency and key fluctuation mode, the dynamic association rule set is specifically item combination change, purchase frequency trend and target combination occurrence number, the item relation graph comprises node definition, side weight and node concentration, the item community structure comprises item grouping set, internal connection density and community independence, the prediction association strength result comprises new item combination potential, relationship strengthening possibility and community prediction trend, and the relationship dynamic trend analysis is specifically stability analysis, change speed and future trend prediction.
3. The big data based data mining method of claim 1, wherein the steps of analyzing a time stamp of an original data set, segmenting the data, each segment representing a snapshot of the data within a target time interval, identifying key features and fluctuation patterns within multiple time periods, and generating a time segment feature set are specifically as follows:
Screening time stamp information in an original data set, sorting according to a time sequence, setting time interval parameters, dividing the data set into a plurality of continuous time periods, wherein each time period comprises data records in the time period, and acquiring a time period data set;
Traversing the time period data set, calculating key indexes in each time period, including average value, maximum value and minimum value, calculating fluctuation indexes of data, including standard deviation, and identifying data distribution characteristics and fluctuation modes in each time period to generate time period characteristic indexes;
Based on the time period characteristic indexes, characteristic changes among multiple time periods are analyzed, a continuous change trend or a periodic fluctuation mode is identified, and a time segment characteristic set reflecting data changes in the multiple time periods is constructed by combining key events or target date information in the time periods.
4. The big data based data mining method according to claim 1, wherein the step of applying a sliding window technique to the time-segmented feature set, adjusting window size and step size, analyzing data changes in each window, identifying a change relation between articles, mining evolution of the change relation with time, and constructing a dynamic association rule set is specifically as follows:
Selecting the time segment feature set as initial data, determining the initial sliding window size and step length, and applying sliding window division to the data set so that each window stores continuous time sequence data and a window sequence set is generated;
Analyzing the data of each window in the window sequence set, calculating the statistical index of each window, including the average value and variance of the data, identifying and marking the key data change and the characteristics in the window, and generating a window characteristic overview;
Using the window feature overview to compare feature differences in the differential window, identifying the mode and relation of data change, and constructing a change relation map;
and analyzing the dynamic association of the time variation among the articles according to the variation relation map, and constructing a dynamic association rule set reflecting the time evolution by integrating and optimizing the dynamic association information.
5. The big data based data mining method according to claim 1, wherein the step of converting the dynamic association rule set into a set of nodes and edges, each article being a node, the association rule being an edge, using weights of the edges between the nodes to represent association strength, and constructing an article relationship graph reflecting interactions between articles is specifically as follows:
Based on the dynamic association rule set, identifying association rules among multiple articles, defining each article as an independent node, converting each association rule into an edge connected with the corresponding article, and jointly deciding the preliminary weight of the edge by the support degree and the confidence degree of the association rule to generate an article node set and a preliminary association edge set;
based on the article node set and the preliminary association edge set, re-calculating the weight of each edge, and adjusting the preliminary weight by adopting a proportionality coefficient so as to feed back the association strength and the directionality among articles and obtain an optimized association edge set;
and constructing an interaction diagram among the whole articles by using the optimized association edge set, so that each node represents one article, each edge represents the association between two articles, the weight of the edge represents the association strength, and an article relation diagram is built.
6. The big data based data mining method according to claim 1, wherein community discovery logic is applied on the item relation graph, the nodes of the same class or association are grouped according to the connection density and mode of the nodes, the inherent association between differentiated item groups is revealed, and the step of obtaining the item community structure is specifically as follows:
Collecting nodes and connection information in the article relation graph, drawing each article to be used as a node by using a graphic representation method, and constructing an article network graph by using the interrelationship among the articles as edges;
Calculating connection density among nodes based on the object network graph, marking a region with high connection density as a potential community, gradually adding edge nodes until a new node cannot be added any more, and forming a community primary partition;
Optimizing the preliminary division of the communities, adjusting the connection density threshold and the community size standard, and improving the internal consistency and the external distinction degree of the communities through merging or dividing operation to obtain the community structure of the object.
7. The big data-based data mining method according to claim 1, wherein the step of combining the data in the predicted association strength result with a dynamic association rule set, analyzing a development trend of the object relationship with time, evaluating stability and change speed of the relationship, and establishing a relationship dynamic trend analysis specifically comprises:
Extracting the association strength value of each pair of articles at the differentiated time point from the predicted association strength result, grouping the association strength values by using a time tag, wherein each group of data represents the article relationship condition in a target time period, thereby constructing a time sequence association matrix;
Calculating the change trend of the article relation strength for the time sequence incidence matrix in each time period, wherein the change trend comprises the average incidence strength, the highest incidence strength and the lowest incidence strength of each time period and the change rate of the average incidence strength, the highest incidence strength and the lowest incidence strength of each time period, so as to obtain an incidence strength change index of each time period;
And analyzing the development trend of the object relationship in a plurality of time periods by using the association strength change indexes, and predicting the stability and the change speed of the object relationship in the future by combining the historical change rules to establish the relationship dynamic trend analysis.
8. The big data-based data mining system is characterized by being applied to the big data-based data mining method according to any one of claims 1-7, and comprises a time sequence analysis module, a feature mining module, a rule dynamic updating module, a relational network mapping module, a community connection identification module, a correlation strength prediction module and a trend prediction and analysis module;
The time sequence analysis module extracts time stamps from the original data, sorts the time stamps according to time sequence, sets time intervals, divides the data into continuous time periods, calculates basic statistics of the data in multiple time periods, including total number, average value and extremum, and generates a time sequence data snapshot;
The feature mining module analyzes the time sequence data snapshot, identifies data fluctuation and key indexes in each time period, extracts feature changes and key fluctuation modes by combining the changes of the data in multiple time periods, and acquires time period feature analysis;
The rule dynamic updating module applies a sliding window dividing technology to analyze the characteristics of the time period, gradually adjusts the window size and the step length, analyzes the change and the mode of data in the continuous time period, extracts the change relation and the trend, and generates a dynamic association model;
The relation network mapping module maps a plurality of articles and variation relations thereof into a form of a graph by using a dynamic association model, the articles are used as nodes, the variation relations are converted into edges, and the weights of the edges are adjusted to reflect the association strength, so that an article relation network is constructed;
The community connection identification module analyzes the article relation network, identifies connected node groups by utilizing the network structure characteristic, groups the node groups according to the connection density, and defines the groups as differentiated communities to obtain a community structure diagram;
The association strength prediction module is used for evaluating potential connection probabilities of nodes in multiple communities and the nodes based on a community structure diagram, predicting new association or reinforcing association by referring to common attributes and historical interactions among the nodes, and generating an association prediction matrix;
The trend prediction and analysis module analyzes data in the association prediction matrix, tracks the evolution of the article relationship between the differentiated time points, identifies the trend of the relationship change, including stability and development speed, and establishes a trend analysis framework.
CN202410284239.XA 2024-03-13 2024-03-13 Data mining method and system based on big data Active CN117891857B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410284239.XA CN117891857B (en) 2024-03-13 2024-03-13 Data mining method and system based on big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410284239.XA CN117891857B (en) 2024-03-13 2024-03-13 Data mining method and system based on big data

Publications (2)

Publication Number Publication Date
CN117891857A CN117891857A (en) 2024-04-16
CN117891857B true CN117891857B (en) 2024-05-24

Family

ID=90639861

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410284239.XA Active CN117891857B (en) 2024-03-13 2024-03-13 Data mining method and system based on big data

Country Status (1)

Country Link
CN (1) CN117891857B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017069548A1 (en) * 2015-10-23 2017-04-27 아주대학교산학협력단 Apparatus for visualizing analysis of set relationship in complex network and method therefor
CN107239498A (en) * 2017-05-03 2017-10-10 同济大学 A kind of method for excavating overlapping community's dynamic evolution correlation rule
CN110288003A (en) * 2019-05-29 2019-09-27 北京师范大学 Data variation recognition methods and equipment
CN111639251A (en) * 2020-06-16 2020-09-08 李忠耘 Information retrieval method and device
CN111831706A (en) * 2020-06-30 2020-10-27 新华三大数据技术有限公司 Mining method and device for association rules among applications and storage medium
CN112053210A (en) * 2020-09-11 2020-12-08 深圳市梦网视讯有限公司 Commodity community classification-based associated value propagation method, system and equipment
CN113094284A (en) * 2021-04-30 2021-07-09 中国工商银行股份有限公司 Application fault detection method and device
CN115511233A (en) * 2021-06-22 2022-12-23 国网上海市电力公司 Supply chain process reproduction method and system based on process mining
CN117591944A (en) * 2024-01-19 2024-02-23 广东工业大学 Learning early warning method and system for big data analysis

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017069548A1 (en) * 2015-10-23 2017-04-27 아주대학교산학협력단 Apparatus for visualizing analysis of set relationship in complex network and method therefor
CN107239498A (en) * 2017-05-03 2017-10-10 同济大学 A kind of method for excavating overlapping community's dynamic evolution correlation rule
CN110288003A (en) * 2019-05-29 2019-09-27 北京师范大学 Data variation recognition methods and equipment
CN111639251A (en) * 2020-06-16 2020-09-08 李忠耘 Information retrieval method and device
CN111831706A (en) * 2020-06-30 2020-10-27 新华三大数据技术有限公司 Mining method and device for association rules among applications and storage medium
CN112053210A (en) * 2020-09-11 2020-12-08 深圳市梦网视讯有限公司 Commodity community classification-based associated value propagation method, system and equipment
CN113094284A (en) * 2021-04-30 2021-07-09 中国工商银行股份有限公司 Application fault detection method and device
CN115511233A (en) * 2021-06-22 2022-12-23 国网上海市电力公司 Supply chain process reproduction method and system based on process mining
CN117591944A (en) * 2024-01-19 2024-02-23 广东工业大学 Learning early warning method and system for big data analysis

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Discovering Association Rules with Graph Patterns in Temporal Networks;Chu Huang et al;《TSINGHUA SCIENCE AND TECHNOLOGY》;20230430;第344–359页 *

Also Published As

Publication number Publication date
CN117891857A (en) 2024-04-16

Similar Documents

Publication Publication Date Title
AU2018101946A4 (en) Geographical multivariate flow data spatio-temporal autocorrelation analysis method based on cellular automaton
CN110245981B (en) Crowd type identification method based on mobile phone signaling data
Calders et al. Building classifiers with independency constraints
Utari et al. Implementation of data mining for drop-out prediction using random forest method
Al-Hagery et al. Data mining methods for detecting the most significant factors affecting students’ performance
CN110990718B (en) Social network model building module of company image lifting system
CN111984873B (en) Service recommendation system and method
CN111985623A (en) Attribute graph group discovery method based on maximized mutual information and graph neural network
CN111191825A (en) User default prediction method and device and electronic equipment
CN110310012B (en) Data analysis method, device, equipment and computer readable storage medium
Zhang et al. Online decision trees with fairness
CN117891857B (en) Data mining method and system based on big data
CN111984514B (en) Log anomaly detection method based on Prophet-bLSTM-DTW
Hu Pattern classification using grey tolerance rough sets
Chia et al. A data mining approach to evolutionary optimisation of noisy multi-objective problems
Farajpour Khanaposhtani A new multi-attribute decision-making method for interval data using support vector machine
CN115329078B (en) Text data processing method, device, equipment and storage medium
Ou et al. On data mining for direct marketing
CN113961818B (en) Group demand prediction method based on long-short-period interests and social influence
CN112288571B (en) Personal credit risk assessment method based on rapid construction of neighborhood coverage
CN114861936A (en) Feature prototype-based federated incremental learning method
CN113919415A (en) Abnormal group detection method based on unsupervised algorithm
CN109379282B (en) Network community detection method based on multi-label propagation
Wang et al. Research on the Youth Group's Expectations for the Future Development of self-Media while in the Digital Economy
CN117216796B (en) Energy big data privacy protection method based on privacy class

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant