CN118211038A - Multidimensional data processing analysis method, device, system and storage medium - Google Patents
Multidimensional data processing analysis method, device, system and storage medium Download PDFInfo
- Publication number
- CN118211038A CN118211038A CN202410632459.7A CN202410632459A CN118211038A CN 118211038 A CN118211038 A CN 118211038A CN 202410632459 A CN202410632459 A CN 202410632459A CN 118211038 A CN118211038 A CN 118211038A
- Authority
- CN
- China
- Prior art keywords
- data
- item
- cluster
- idf
- frequent
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012545 processing Methods 0.000 title claims abstract description 87
- 238000003860 storage Methods 0.000 title claims abstract description 17
- 238000004458 analytical method Methods 0.000 title claims description 59
- 239000013598 vector Substances 0.000 claims abstract description 207
- 238000000034 method Methods 0.000 claims abstract description 122
- 239000011159 matrix material Substances 0.000 claims abstract description 111
- 238000012216 screening Methods 0.000 claims abstract description 17
- 230000011218 segmentation Effects 0.000 claims description 64
- 238000004422 calculation algorithm Methods 0.000 claims description 44
- 230000008569 process Effects 0.000 claims description 43
- 238000010276 construction Methods 0.000 claims description 38
- 238000005065 mining Methods 0.000 claims description 28
- 238000011156 evaluation Methods 0.000 claims description 27
- 238000000605 extraction Methods 0.000 claims description 22
- 238000010586 diagram Methods 0.000 claims description 13
- 230000000694 effects Effects 0.000 claims description 13
- 230000009467 reduction Effects 0.000 claims description 13
- 238000004364 calculation method Methods 0.000 claims description 10
- 230000006870 function Effects 0.000 claims description 7
- 238000010606 normalization Methods 0.000 claims description 6
- 238000000354 decomposition reaction Methods 0.000 claims description 5
- 238000009826 distribution Methods 0.000 claims description 5
- 230000003247 decreasing effect Effects 0.000 claims description 4
- 230000017105 transposition Effects 0.000 claims description 4
- 235000008429 bread Nutrition 0.000 description 47
- 235000013336 milk Nutrition 0.000 description 45
- 239000008267 milk Substances 0.000 description 45
- 210000004080 milk Anatomy 0.000 description 45
- 235000017060 Arachis glabrata Nutrition 0.000 description 23
- 241001553178 Arachis glabrata Species 0.000 description 23
- 235000010777 Arachis hypogaea Nutrition 0.000 description 23
- 235000018262 Arachis monticola Nutrition 0.000 description 23
- 235000020232 peanut Nutrition 0.000 description 23
- 235000013405 beer Nutrition 0.000 description 10
- 238000007405 data analysis Methods 0.000 description 10
- 239000008186 active pharmaceutical agent Substances 0.000 description 7
- 238000007621 cluster analysis Methods 0.000 description 6
- 238000013439 planning Methods 0.000 description 5
- 238000007781 pre-processing Methods 0.000 description 5
- 238000000638 solvent extraction Methods 0.000 description 5
- 229940034610 toothpaste Drugs 0.000 description 5
- 239000000606 toothpaste Substances 0.000 description 5
- 238000013459 approach Methods 0.000 description 4
- 230000006399 behavior Effects 0.000 description 4
- 230000008859 change Effects 0.000 description 4
- 238000012544 monitoring process Methods 0.000 description 4
- 238000003672 processing method Methods 0.000 description 4
- 230000004044 response Effects 0.000 description 4
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000005520 cutting process Methods 0.000 description 3
- 238000007418 data mining Methods 0.000 description 3
- 230000008451 emotion Effects 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 3
- 238000007726 management method Methods 0.000 description 3
- 238000005457 optimization Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000010354 integration Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 235000015205 orange juice Nutrition 0.000 description 2
- 238000000513 principal component analysis Methods 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 206010000117 Abnormal behaviour Diseases 0.000 description 1
- 241001178520 Stomatepia mongo Species 0.000 description 1
- 230000004931 aggregating effect Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000009412 basement excavation Methods 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 230000010485 coping Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 239000006185 dispersion Substances 0.000 description 1
- 230000002349 favourable effect Effects 0.000 description 1
- 230000008676 import Effects 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 238000003064 k means clustering Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000012384 transportation and delivery Methods 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
- 238000007794 visualization technique Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
- G06F16/322—Trees
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
- G06F18/2135—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/26—Discovering frequent patterns
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application discloses a multidimensional data processing and analyzing method, a multidimensional data processing and analyzing device, a multidimensional data processing and analyzing system and a multidimensional data processing and analyzing storage medium. The method comprises the following steps: acquiring multi-source heterogeneous service data through a pre-configured data interface; extracting features from unstructured service data to form a text data set; traversing the text data set to construct a TF-IDF vector matrix; constructing a ball tree index space; performing density-based clustering on all data points to obtain a plurality of clusters; constructing a candidate 1-item set by taking one cluster as one item; screening out items not smaller than a first support threshold value to obtain a frequent 1-item set, wherein each item in the frequent 1-item set consists of a cluster; based on the mode, constructing a frequent n-item set until an n+1 frequent item set cannot be constructed; and determining the items conforming to the preset confidence coefficient based on all the generated frequent item sets, and outputting the association rule.
Description
Technical Field
The present application relates to the field of big data processing technologies, and in particular, to a method, an apparatus, a system, and a storage medium for multidimensional data processing analysis.
Background
Data is a broad area of coverage, focusing primarily on the size of data sets, and the ability to mine information from such data. In essence, big data is not just about how much data is, but rather a multi-dimensional challenge that requires engineers to re-think at multiple levels how to capture, store, share, analyze, and visualize data. The following features of big data are expressed in several ways:
Volume (Volume)
Volume is the most direct big data feature, which refers to the scale of the data itself. With the popularity of the internet, mobile devices, and internet of things (IoT), humans are creating and storing more data per day than ever. From posts to satellite images of social media, from sensor networks to transaction records, the amount of data grows exponentially. The size of these data far exceeds the processing power of conventional database software, and new technologies are needed to store and process them efficiently.
Diversity (Variety)
Diversity relates to the various forms and types of data. The data may be structured, such as tables in a database; unstructured, such as text, pictures, video, and audio, are also possible. Within the scope of big data, engineers must process data from different sources in different formats and fuse the data together to extract valuable information. For example, an enterprise may need to combine the customer's online interaction data with internal sales records to obtain a more comprehensive view.
Speed (Velocity)
The speed refers to the generation speed of data and the speed at which data needs to be processed. In many cases, the data flows in extremely fast, requiring real-time or near real-time processing in order to make decisions or responses quickly. For example, stock exchange data in financial markets, real-time posts on social media, or sensor data in emergency response systems all require quick analysis and response.
Truth value (Veracity)
Truth concerns the quality and accuracy of the data. In large and diverse amounts of data, ensuring the authenticity and consistency of the data is a significant challenge. The data may be incomplete, outdated, or contain misleading information. In order to make reliable decisions based on data, it is necessary to evaluate the quality of the data first, and then to employ appropriate data cleansing and preprocessing steps to promote the truth of the data.
To address these challenges, a variety of techniques and tools have emerged, such as distributed computing frameworks (e.g., apache Hadoop), streaming computing techniques for real-time processing (e.g., APACHE SPARK), and NoSQL databases for processing unstructured data. In addition, artificial intelligence and machine learning techniques also play an important role in big data analysis, which can help identify patterns, predict trends, and provide insight.
In a big data environment, data is not only bulky and complex, but it often exists in a multi-dimensional form, which provides an overall view for engineers to understand and parse the data. Analysis of multidimensional data, particularly with the support of the prior art, enables engineers to drill deep into the rich information and insight that is contained in the data.
The processing and analysis of multidimensional data generally requires the aid of advanced techniques and algorithms due to its inherent complexity. Dimension reduction, cluster analysis, and association rule mining are several of the techniques commonly used. These methods can help engineers extract valuable information from multidimensional data, reveal hidden patterns and relationships, and further support more efficient decision-making.
Dimension reduction
Dimension reduction techniques aim to reduce the number of features in the data while maintaining as much important information as possible of the original data. This not only reduces the storage and computation requirements of the data, but also improves the performance of the data analysis model. Principal Component Analysis (PCA) is a popular linear dimension reduction technique that reduces the dimension of data by finding the directions in the data where the variance is greatest, and projecting the data into these directions. Linear Discriminant Analysis (LDA) is another approach that aims to identify the most efficient features that distinguish different classes, typically for supervised learning scenarios.
Cluster analysis
Cluster analysis is an unsupervised learning technique for grouping data points into "clusters" such that the similarity between data points in the same cluster is high, while the similarity between data points in different clusters is low. The K-Means algorithm is one of the most widely known clustering methods, forming clusters by optimizing the sum of squares of intra-cluster distances. Hierarchical clustering builds a hierarchical structure of clusters by gradually merging or splitting existing clusters. Spectral clustering uses spectral (i.e., eigenvalue) properties of the data to perform clustering, which is more efficient for certain types of data sets (e.g., clusters that are complex in shape or large in size).
Association rule mining
Association rule mining aims to find interesting relationships between items from a large dataset, e.g. "if item a is purchased, item B is also likely to be purchased".
Despite the significant advances in multidimensional data processing and analysis, the prior art still exhibit some limitations when challenged in the face of many aspects of today's data environment.
Efficiency problem for handling large-scale data
Conventional data processing algorithms encounter bottlenecks in efficiency as the volume of data expands. The aspects of data read-write operation, calculation process, memory management and the like consume a great deal of time and resources. Particularly for algorithms requiring iterative computations (e.g., certain machine learning algorithms), the runtime on a large-scale data set may increase dramatically. This not only limits the real-time nature of the analysis, but also greatly increases the cost of resources.
For example, when performing association learning on large-scale data, performing association rule learning on a large-scale data set may result in a large number of rules, making it difficult to find truly valuable rules therefrom.
For another example, when some clustering algorithms are executed, a huge data set will cause that a large amount of computing resources are required to be occupied by each time the database is traversed, and if some clustering algorithms are executed, the database needs to be frequently traversed and queried, which makes the algorithm have extremely high requirements on the computing power of the device.
"Dimension curse" by high dimension "
By "curse of dimensions" is meant that the amount of samples required for data analysis and mining increases exponentially as the dimension of the data increases. In high-dimensional space, the distance differences between data points become less pronounced, which poses a challenge to distance-based algorithms (e.g., K-Means clustering). In addition, the high-dimensional data also makes visualization and interpretation more difficult.
For example, for some existing clustering algorithms, which typically divide data into different populations, they do not provide a direct explanation as to why certain data points are clustered together.
As another example, association rule learning when performed on a large data set may result in a large number of rules, making it difficult to find truly valuable rules therefrom.
Diversity of data sources
Modern datasets typically contain multiple types of data, such as text, images, sound, time series, and the like. Each type of data has its own unique structure and characteristics, which require specific preprocessing and feature extraction prior to analysis.
Disclosure of Invention
In order to solve the above technical problems, the present application provides a multidimensional data processing and analyzing method, apparatus, system and storage medium, and the following describes the technical scheme in the present application:
The first aspect of the application provides a multidimensional data processing and analyzing method, which comprises the following steps:
Acquiring multi-source heterogeneous service data through a pre-configured data interface, wherein the multi-source heterogeneous service data comprises structured service data and unstructured service data;
Extracting features from the unstructured service data, storing the extracted features, and storing the extracted features and the structured service data together into a preset data structure to form a text data set;
traversing the text data set, constructing TF-IDF vectors of all documents in the text data set, and forming a TF-IDF vector matrix;
Recursively grouping each vector in a TF-IDF vector matrix into a nested ball tree by using a ball tree construction algorithm to obtain a ball tree index space in which the TF-IDF vector is nested;
based on the ball tree index space, clustering all data points based on density to obtain a plurality of clusters;
assigning a unique identification code to each cluster, and replacing all data points contained in the cluster with the identification codes;
According to the unique identification code, constructing a candidate 1-item set by taking a cluster as an item;
calculating the support degree of each item in the candidate 1-item set in a pre-constructed transaction library, and comparing the support degree with a preset first support degree threshold value, wherein the transaction library comprises a plurality of transactions, and the transactions consist of one or more items;
Screening out items not smaller than the first support threshold value to obtain a frequent 1-item set, wherein each item in the frequent 1-item set consists of a cluster;
Based on the mode, constructing a frequent n-item set until an n+1 frequent item set cannot be constructed;
and determining the items conforming to the preset confidence coefficient based on all the generated frequent item sets, and outputting the association rule.
Optionally, traversing the text data set to construct TF-IDF vectors for all documents in the text data set, and forming a TF-IDF vector matrix includes:
for any vocabulary in the text data set, calculating the occurrence frequency of the vocabulary in the text data set to obtain word frequency TF;
Calculating the document proportion of the documents containing the vocabulary in the text data set to the total document number, and calculating the logarithm of the document proportion to obtain an inverse document frequency IDF;
traversing the text data set, and constructing TF-IDF vectors according to the obtained TF and IDF;
and constructing a TF-IDF vector matrix for all documents in the text data set by using the obtained TF-IDF vector.
Optionally, for any vocabulary in the text data set, calculating the frequency of occurrence of the vocabulary in the text data set, and obtaining the word frequency TF includes:
Word segmentation processing is carried out on any text data in the text data set;
Calculating word frequencies of various words in the text data by the following formula:
TF(t, D) = C(t, D) / T(D);
Wherein:
TF represents the vocabulary frequency;
t represents a given vocabulary;
D represents a given document;
C (t, D) represents the number of times the vocabulary t appears in the document D;
T (D) represents the total vocabulary in document D.
Optionally, the calculating a document proportion of the documents in the text dataset, which contain the vocabulary, to a total document number, and calculating a logarithm of the document proportion, to obtain an inverse document frequency IDF includes:
the inverse document frequency in the text data is calculated by:
IDF(t)=log_e(N_doc/df(t));
Wherein:
IDF represents the inverse document frequency;
t represents a given vocabulary;
n_doc represents the total number of documents in the text dataset;
df (t) represents the number of documents containing the vocabulary t;
log_e is a logarithmic function based on a natural constant e.
Optionally, the constructing a TF-IDF vector according to the obtained TF and IDF includes:
The TF-IDF vector is calculated by the following equation:
TF-IDF(t,D)=TF(t,D)*IDF(t)。
Optionally, the clustering based on the density is performed on all the data points based on the ball tree index space, and obtaining a plurality of clusters includes:
Determining a preset neighborhood radius Eps and a minimum point number MinPts;
For any data point in the ball tree index space, determining the number of data points contained in the EPS field by utilizing the ball tree index space;
If the number of the data points contained in any data point is not less than the MinPts, determining the data point as a core object;
Creating an empty output list for storing the output data points;
Creating an empty priority queue, wherein the priorities of the priority queues are arranged based on reachable distances, the reachable distances represent the distances from the data points to the nearest core object, and if the data points are not processed, the corresponding reachable distances are set to infinity;
Randomly selecting an unprocessed target point p from the ball tree index space, marking the processed target point p, and adding the processed target point p into the output list;
The entire ball tree index space is traversed by the following steps until all data points are marked as processed:
Step a: if the target point p is a core object, determining the distance from the target point p to each unprocessed data point of the target point p in the EPS field by utilizing the ball tree index space, adding the distance to the priority queue, and updating the distance from the target point p to the nearest core object;
If the data point is already in the priority queue and the distance from the data point to the target point p is smaller than the current reachable distance, updating the distance of the data point in the priority queue;
step b: taking out the data point with the smallest reachable distance from the priority queue, marking the data point as processed, and adding the data point into the output list;
step c: if the extracted data point is a core object, repeating the step a;
Step d: if the priority queue is not empty, repeating the steps b and c;
after the ordered output list is output, cluster extraction is carried out on the ordered list, and a cluster set formed by a plurality of clusters is obtained.
Optionally, traversing the text data set to construct TF-IDF vectors for all documents in the text data set, and forming a TF-IDF vector matrix includes:
The TF-IDF values of all words in the document are formed into TF-IDF vectors;
The TF-IDF vectors of all documents are arranged according to rows to form a TF-IDF vector matrix;
The number of rows of the matrix is the number of documents, and the number of columns is the number of words.
Optionally, the recursively grouping each vector in the TF-IDF vector matrix into a nested ball tree using a ball tree construction algorithm, obtaining a ball tree index space nested with the TF-IDF vector includes:
selecting a vector from the TF-IDF vector matrix as a root node of a ball tree, and taking the vector as an original ball tree, wherein the original ball tree comprises all vectors in the TF-IDF vector matrix;
Dividing the vectors in the original ball tree into a first group and a second group according to a first predetermined dividing value, wherein the distance from the vectors in the first group to the root node is smaller than or equal to the first dividing value, and the distance from the vectors in the second group to the root node is larger than the first dividing value;
recursively constructing leaf nodes for each set of vectors;
Repeating the steps for each leaf node until a termination condition is met, wherein the termination condition is that the number of vectors contained in each leaf node is smaller than a preset termination threshold value; when the termination condition is satisfied, a ball tree index space is formed.
Optionally, the first segmentation value is determined by:
determining projection characteristics, and calculating projection values of all TF-IDF vectors in the directions corresponding to the projection characteristics based on the projection characteristics to form a projection value sequence;
a median of the sequence of projection values is determined and the median is determined as the first segmentation value.
Optionally, the determining the projection characteristic includes:
carrying out standardization processing on all TF-IDF vectors;
Calculating a covariance matrix of the normalized TF-IDF vector matrix;
performing eigenvalue decomposition on the covariance matrix to obtain eigenvalues and corresponding eigenvectors, wherein the eigenvalues represent variances of the data in the direction of the eigenvectors, and the eigenvectors represent importance of the data in the direction of the eigenvectors;
and determining the number of projection features to be reserved according to a preset dimension reduction target, and determining corresponding feature vectors as projection features.
Optionally, the normalizing all TF-IDF vectors includes:
Calculating the mean value and standard deviation of each TF-IDF vector;
For each TF-IDF vector in the TF-IDF vector matrix, performing a normalization process by:
y=(a-m)/s;
where a is the original TF-IDF value, m is the mean value of the corresponding TF-IDF features, and s is the standard deviation of the corresponding TF-IDF features.
Optionally, the calculating the covariance matrix of the normalized TF-IDF vector matrix includes:
performing transposition operation on the standardized TF-IDF matrix so that each row represents a TF-IDF characteristic and each column represents a document;
Performing product operation on the transposed TF-IDF matrix and the original TF-IDF matrix to obtain an intermediate matrix;
Dividing each element in the intermediate matrix by the total document number contained in the text data set to obtain a covariance matrix.
Optionally, after outputting the ordered output list, performing cluster extraction on the ordered list to obtain a cluster set formed by a plurality of clusters, where the obtaining includes:
Drawing a scatter diagram based on the ordered output list, wherein the scatter diagram comprises an x-axis and a y-axis, the x-axis represents the index of a data point in the ball tree index space, and the y-axis represents the reachable distance of the data point;
Identifying and marking valleys on the scatter plot;
determining a cluster segmentation threshold according to the identified valley, wherein the cluster segmentation threshold is larger than a y value corresponding to the valley;
Traversing the output list graph according to the cluster segmentation threshold value, and distributing continuous data points to a cluster, wherein the reachable distance of the data points in the cluster is smaller than or equal to the corresponding cluster segmentation threshold value;
and when the reachable distance of the data point is higher than the current cluster segmentation threshold value, starting a new cluster until the output list is traversed, and obtaining a cluster set.
Optionally, the identifying and marking the valley on the scatter plot includes:
first or second derivatives of the reachable distances on the scatter plot are calculated to identify valleys.
Optionally, after traversing the output list and obtaining a plurality of clusters, the method further includes:
and calculating the number of data points contained in each cluster, and eliminating the clusters smaller than a preset number threshold.
Optionally, the calculating the support of each item in the candidate 1-item set in the pre-constructed transaction includes:
the support of each item in the candidate 1-item set in the transaction is calculated as follows:
support(item_i)=count(item_i)/N;
Wherein the item contained in the candidate 1-item set is item_i, count (item_i) represents the occurrence times of the item in all transactions, and N is the total transaction number; the support (item_i) represents the support of item item_i.
Optionally, after constructing the frequent n-term set, the method further includes:
Based on the support degree of each item in the frequent 1-item set, ordering the support degree in a descending order, and generating an item header table, wherein the item header table stores names and support degrees of the items;
Constructing a tree data structure of each item based on the item header table;
The new frequent item set is mined through the tree data structure;
And merging the new frequent item set obtained by mining with the original frequent item set to obtain a final frequent item set.
Optionally, the building the tree data structure of each item based on the item header table includes:
Creating a root node of the attribute data structure;
Starting from the root node, according to the ordering of each item in the item header table, sequentially growing the items in each transaction in the transaction library on the root node to obtain a tree data structure.
Optionally, mining the new frequent item set through the tree data structure includes:
the item header table also comprises a node position pointer of the item in the tree data structure;
traversing the tree data structure for a target item in the item header table based on a node position pointer corresponding to the target item to obtain an item path containing the target item;
Traversing the whole item head table, and storing all obtained item paths as a new frequent item set.
Optionally, the determining, based on all the generated frequent item sets, an item meeting a preset confidence level, and outputting an association rule includes:
Generating an association rule set based on the frequent item set, wherein the association rule set comprises a plurality of association rules;
The rule confidence of each association rule is calculated as follows:
For association rule X-Y, then there are:
Confidence (X-Y) =support (X, Y)/support (X);
In the association rule of X-Y, X is a condition, Y is a result, and the association rule X-Y indicates that under the condition that the condition X appears, the result Y also appears at the same time; the support degree (X, Y) represents the support degree of the frequent item (X, Y), and the support degree (X) represents the support degree of the frequent item (X);
Based on the confidence coefficient of each association rule, comparing the confidence coefficient with a preset confidence coefficient threshold value, screening out the association rule meeting the confidence coefficient requirement, and outputting, wherein the confidence coefficient requirement comprises that the confidence coefficient is not smaller than the confidence coefficient threshold value.
Optionally, after obtaining the cluster set formed by the plurality of clusters, the method further includes:
evaluating the clustering effect of each cluster in the cluster set to obtain an evaluation result;
and adjusting the value of the field radius Eps or the minimum point MinPts according to the evaluation result.
Optionally, the evaluating the clustering effect of each cluster in the cluster set, and obtaining the evaluation result includes:
for each cluster, the intra-class sum of squares ss_w is calculated by:
SS_W=Σ_{i=1}{k}Σ_{x∈C_i}||x-μ_i||2;
Wherein ss_w represents the sum of squares within the class, i represents the i-th cluster, i=1, 2,..k, x e Ci represents each element in cluster Ci, |x- μ_i|| 2 represents the square of the distance of element x to cluster centroid μ_i;
for each cluster, the sum of squares between classes is calculated by the following equation:
SS_B=Σ_{i=1}{k}n_i||\bar{x}_i-\bar{x}||2;
Wherein ss_b represents the sum of squares between classes, i represents the i-th cluster, i=1, 2,..k, n_i represents the number of elements in cluster c_i, ||\bar { x } -i\bar { x } | 2 represents the square of the distance from cluster centroid\bar { x } -i to centroid\bar { x } of the cluster set;
The evaluation index was calculated by the following formula:
CH-Index={SS_B/(k-1)}/{SS_W/(n-k)};
Where k represents the number of clusters and n represents the total number of TF-IDF vectors contained in the spherical tree index space.
Optionally, the adjusting the value of the radius Eps or the minimum point number MinPts according to the evaluation result includes:
If the CH-Index is larger than a preset expected value, decreasing the field radius Eps or increasing the minimum point MinPts;
and if the CH-Index is smaller than the preset expected value, increasing the radius Eps of the field or reducing the minimum point MinPts.
Optionally, the assigning a unique identification code to each cluster and replacing all the data points contained in the cluster with the identification codes includes:
Creating an integer variable for generating a unique cluster identification code;
For each cluster, assigning a new unique identification code using the integer variable;
traversing each data point in all clusters;
for each data point, determining the cluster to which the data point belongs, and replacing the value of the data point with the unique identification code of the cluster.
A second aspect of the present application provides a multidimensional data processing analysis device, the device comprising:
the system comprises a data acquisition unit, a data processing unit and a data processing unit, wherein the data acquisition unit is used for acquiring multi-source heterogeneous service data through a pre-configured data interface, and the multi-source heterogeneous service data comprises structured service data and unstructured service data;
The feature extraction unit is used for extracting features from the unstructured service data, storing the extracted features, and storing the extracted features and the structured service data into a preset data structure together to form a text data set;
The vector construction unit is used for traversing the text data set, constructing TF-IDF vectors of all documents in the text data set and forming a TF-IDF vector matrix;
A ball tree construction unit, configured to recursively group each vector in a TF-IDF vector matrix into nested ball trees using a ball tree construction algorithm, to obtain a ball tree index space in which the TF-IDF vector is nested;
The clustering unit is used for performing density-based clustering on all data points based on the ball tree index space to obtain a plurality of clustering clusters;
the identification code distribution unit is used for distributing a unique identification code for each cluster and replacing all data points contained in the cluster with the identification codes;
a candidate item set construction unit, configured to construct a candidate 1-item set by using a cluster as an item according to the unique identification code;
The support degree calculation unit is used for calculating the support degree of each item in the candidate 1-item set in a pre-constructed transaction library, and comparing the support degree with a preset first support degree threshold value, wherein the transaction library comprises a plurality of transactions, and the transactions consist of one or more items;
The item set screening unit is used for screening out items which are not smaller than the first support threshold value to obtain frequent 1-item sets, wherein each item in the frequent 1-item sets consists of a cluster;
the frequent item set construction unit is used for constructing a frequent n-item set based on the mode until the frequent item set of n+1 cannot be constructed;
and the association rule output unit is used for determining the items conforming to the preset confidence coefficient based on all the generated frequent item sets and outputting association rules.
A third aspect of the present application provides a multidimensional data processing analysis system, the apparatus comprising:
a processor, a memory, an input-output unit, and a bus;
The processor is connected with the memory, the input/output unit and the bus;
the memory holds a program that the processor invokes to perform the method of any of the first aspect and optionally the method of the first aspect.
A fourth aspect of the application provides a computer readable storage medium having stored thereon a program which when executed on a computer performs the method of any of the first aspect and optionally the first aspect.
From the above technical scheme, the application has the following advantages:
1. The method has the capability of processing multi-source heterogeneous business data, and can effectively process both structured and unstructured data. This comprehensiveness allows for the features and information of different types of data to be fully considered during the data analysis process, thereby more fully extracting valuable features.
2. By extracting the characteristics of unstructured service data and storing the extracted characteristic data and the structured service data together, the method is favorable for carrying out subsequent analysis and mining by fully utilizing text data. The extraction and storage process of features makes the data more standardized and operable, providing a basis for subsequent data processing and analysis.
3. The process of converting text data into a TF-IDF vector matrix allows the text data to be numerically processed, which can be processed using a wider range of data analysis techniques and algorithms. The TF-IDF vector can accurately reflect the importance of each word in the text data, and provides a basis for subsequent data analysis.
4. The TF-IDF vector is organized and indexed by adopting a ball tree construction algorithm, which is helpful to improve the retrieval efficiency and clustering effect of the data. Especially for high-dimensional data, the ball tree can effectively process the data, and support is provided for subsequent data clustering and analysis.
5. The method adopts a clustering algorithm based on density, can automatically identify the shape and the size of the clustering clusters, and does not need to preset the number of clusters. The flexibility and the adaptivity enable the clustering result to be more accurate, and the actual condition of the data can be reflected better.
6. By mining frequent item sets and association rules, hidden association relationships and rules in the data can be found. This mining process provides a deeper level of insight and guidance for business decisions, helping to discover potential values and opportunities in the data.
7. The method has higher flexibility and expandability, and can adapt to data of different scales and types. The algorithm has clear steps, is easy to understand and realize, can be applied to various scenes and industries, and has strong universality and applicability.
Drawings
In order to more clearly illustrate the technical solutions of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of an embodiment of a multi-dimensional data processing analysis method provided in the present application;
FIG. 2 is a schematic flow chart of an embodiment of performing density-based clustering in the multi-dimensional data processing and analysis method provided by the present application;
FIG. 3 is a flowchart illustrating an embodiment of constructing a ball tree index space in the multidimensional data processing and analysis method provided by the present application;
FIG. 4 is a flow chart of an embodiment of determining a first segmentation value when constructing a ball tree index space in the multi-dimensional data processing analysis method provided by the present application;
FIG. 5 is a schematic flow chart of an embodiment of extracting clusters from an output list in the multidimensional data processing and analysis method provided by the present application;
FIG. 6 is a flow chart of another embodiment of a multi-dimensional data processing analysis method provided by the present application;
FIG. 7 is a schematic diagram of a tree data structure constructed in the multidimensional data processing and analysis method provided by the present application;
FIG. 8 is a flow chart of a method for multidimensional data processing analysis according to another embodiment of the present application;
FIG. 9 is a schematic diagram of a multi-dimensional data processing and analyzing device according to an embodiment of the present application;
FIG. 10 is a schematic diagram of a multidimensional data processing analysis system according to an embodiment of the present application.
Detailed Description
The method provided by the application can be applied to a plurality of technical fields, not only optimizes the processing efficiency of the whole algorithm, but also has various technical effects, such as:
1. intelligent recommendation system
In an intelligent recommendation system, the technical scheme of the application can play an important role. By comprehensively processing structured and unstructured data generated by users (e.g., browsing records, search history, comments, social media interactions, etc.), the system is able to more accurately understand the needs and preferences of users. The application of the deep learning algorithm enables the system to extract more valuable features from unstructured data, and further improves recommendation accuracy. Meanwhile, by combining with an optimized association rule learning and clustering algorithm, the system can find potential modes and associations in user behaviors and provide more personalized and accurate recommendation for users.
2. Enterprise data analysis and decision support
In the field of enterprise data analysis, the technical scheme of the application can help enterprises to understand the business data more deeply. By processing multi-source data including financial reports, sales records, customer feedback, market trends and the like, the scheme can reveal internal rules and associations in the data and provide powerful support for enterprise decision-making. For example, through cluster analysis, an enterprise may divide clients into different groups and formulate more accurate marketing strategies for the different groups. In addition, the optimized association rule learning algorithm can also help enterprises find sales associations between products, optimize inventory management and product combinations.
3. Social media analysis and public opinion monitoring
In the fields of social media analysis and public opinion monitoring, the technical scheme of the application has wide application. A great deal of unstructured data, including text, images, video, etc., is generated on social media platforms, which contains rich information and user views. Through the processing and analysis of the scheme, the topic trend, the user emotion trend, the influence of key opinion leader and the like on social media can be monitored in real time. The method has important significance in the aspects of brand reputation management, crisis coping, market trend prediction and the like.
4. Smart city and traffic planning
In the field of smart cities and traffic planning, the technical scheme of the application can be applied to analysis and processing of urban big data. By integrating multi-source data including traffic flow, environment monitoring, city planning and the like, the scheme can help city planners to more comprehensively know city running conditions and development trends. The optimized association rule learning algorithm can reveal association and influence mechanisms among different city elements, and provides scientific basis for city planning. Meanwhile, cluster analysis can also help to identify hot spot areas and traffic bottlenecks in cities, and powerful support is provided for traffic planning and optimization.
The following detailed description of specific embodiments of the application is provided:
referring to fig. 1, fig. 1 is a flow chart of an embodiment of a method provided in the present application, where the embodiment includes:
s101, acquiring multi-source heterogeneous service data through a pre-configured data interface, wherein the multi-source heterogeneous service data comprises structured service data and unstructured service data;
In the embodiment of the application, heterogeneous service data is acquired from a plurality of sources through a pre-configured data interface. The data may come from different databases, file systems, APIs, etc. These data include structured data (e.g., data in database tables) and unstructured data (e.g., text, images, etc.).
First, a data source needs to be determined, such as a relational database (e.g., mySQL, postgreSQL, etc.), a NoSQL database (e.g., mongo db, cassandra, etc.), a file system (e.g., HDFS, NFS, etc.), an API interface, or other data source (e.g., kafka stream data, internet of things sensor data, etc.).
For each data source, a specific data interface is designed. For example, for relational databases, JDBC or ODBC interfaces may be used; for a NoSQL database, the specific APIs it provides can be used; for file systems, file I/O operations, etc. may be used.
If the data is acquired through the API, the calling mode, the request parameters, the return format and other information of the API are required to be configured, and corresponding client codes are written to call the API.
According to different data sources, configuring corresponding connection information such as URL, user name and password of a database; an endpoint of an API, authentication information, etc. Such configuration information may be stored in a configuration file or in an environment variable for reference in the code.
Code is written to connect to the data sources and query or read operations are performed to retrieve the data. For structured data, it is necessary to write SQL queries or query functions using database client libraries. For unstructured data, such as text or image files, file read logic needs to be written, and specific libraries can also be used to parse the file content (e.g., using NLP libraries to parse text data).
The raw data obtained may be subjected to some preprocessing, such as format conversion, missing value filling, outlier processing, etc. For unstructured data, such as text, text processing steps such as word segmentation, stop word removal, stem extraction, etc. may also be required. The processed data is stored to the target location, which may be a central database, data warehouse, data lake or other storage system.
Optimization measures such as partitioning, indexing and the like of data may need to be considered during storage to improve the efficiency of subsequent data processing. Setting timing tasks or automating data acquisition and processing flows using data stream processing tools (e.g., APACHE FLINK, apache Beam, etc.).
S102, extracting features from the unstructured service data, storing the extracted features, and storing the extracted features and the structured service data together into a preset data structure to form a text data set;
In this step, feature extraction is performed on unstructured data, for example, extracting keywords from text, extracting visual features from images, and the like. The extracted feature data is stored in a predetermined data structure together with the structured data to form a text data set.
In this step, unstructured data and structured data may be processed, and detailed implementation procedures are described below:
1. unstructured data processing
1.1, Extracting text data characteristics:
Word segmentation: for Chinese text, word segmentation is first required to segment sentences into individual words. Chinese word segmentation tools such as jieba, THULAC, etc. may be used.
Removing stop words: stop words refer to words that frequently occur in text but do not contribute much to the meaning of the text, such as "yes", "in", etc. These words are typically removed.
Keyword extraction: the keywords in the text may be extracted using TF-IDF, textRank, or other algorithms.
Word vector representation: the text is converted into a vector form, so that subsequent mathematical operation is facilitated. A common method is Word2Vec, gloVe, fastText, etc.
1.2, Image data feature extraction
Pretreatment: the image is scaled, cropped, normalized, etc. to accommodate the feature extraction algorithm.
Feature extraction: features are extracted from the image using a deep learning model (e.g., CNN) or conventional computer vision methods (e.g., SIFT, SURF).
Vector representation: the extracted features are converted into vector form, so that the subsequent processing is facilitated.
2. Structured data processing
Structured data is typically already in tabular form, containing well-defined fields and values. But some preprocessing work such as data cleaning, missing value filling, outlier handling, etc. may also be performed.
3. Integration of feature data with structured data
Alignment of data: the unstructured data and the structured data can be logically corresponded. For example, if text or image data is associated with a particular record, it may be desirable to have such association preserved during the integration process.
Selecting a data structure: an appropriate data structure is selected to store the integrated data. This may be a table of a relational database or a structured data file (e.g., CSV, JSON, etc.).
Data insertion: the processed unstructured data features (e.g., text keywords, image feature vectors) are inserted into the selected data structure along with the structured data.
An index may also be created for the integrated dataset to improve the efficiency of subsequent queries and processing.
S103, traversing the text data set, constructing TF-IDF vectors of all documents in the text data set, and forming a TF-IDF vector matrix;
In this step, the text dataset is traversed and a TF-IDF (word frequency-inverse document frequency) vector is constructed for each document. This vector represents the importance of each word in the document. The TF-IDF vectors of all documents are combined into a TF-IDF vector matrix.
In this embodiment, a TF-IDF (Term Frequency-Inverse Document Frequency) vector of text needs to be constructed. The following is an example of one specific implementation of how to construct a TF-IDF vector for each document and combine these vectors into a TF-IDF matrix:
The text is first preprocessed to separate each document into separate words or phrases. Common words that do not have a great meaning to text, such as "yes", "in", etc., are removed. The vocabulary is converted into its basic form. The preprocessing operation may also be completed in step S102.
A Term Frequency (TF) is calculated, and for each document, the number of times each word appears in the document (Term Frequency) is calculated. Word frequency (t, D) is expressed as the number of times word t occurs in a given document D divided by the total number of words in document D.
An inverse document frequency (IDF, inverse Document Frequency) is calculated, which is used to measure the importance of a word in all document sets. It is defined as the logarithm of the ratio of the total number of documents to the number of documents containing a particular vocabulary. IDF (t) =log_e (n_doc/df (t)), where n_doc represents the total number of documents in the text dataset and df (t) represents the number of documents containing the vocabulary t, log_e being a logarithmic function based on the natural constant e.
For each word in each document, its TF-IDF value is calculated. The TF-IDF value is the product of the word frequency and the inverse document frequency: TF-IDF (t, D) =tf (t, D) ×idf (t).
And finally, constructing a TF-IDF matrix, combining the TF-IDF values of each document into a vector, and combining the vectors of all the documents into a matrix, wherein rows represent the documents, columns represent the vocabulary, and the value of each cell is the TF-IDF value of the corresponding word in the corresponding document.
An example of a more specific implementation is provided below:
in Python, the TF-IDF matrix can be calculated using scikit-learn libraries, one example of which is provided below:
Example of#TF-IDF calculation code
from sklearn.feature_extraction.text import TfidfVectorizer
# Suppose documents are a list of documents, each document being a string
documents = [
' First document. ',
' Second document. ',
Other documents.
]
# Initialization TfidfVectorizer
vectorizer=TfidfVectorizer(tokenizer=lambdax:x.split(), lowercase=False)
Calculation of TF-IDF matrix using TfidfVectorizer #
tfidf_matrix = vectorizer.fit_transform(documents)
# Output TF-IDF matrix
print(tfidf_matrix.toarray())
# Obtain feature names (i.e., vocabulary)
print(vectorizer.get_feature_names_out())
In practical applications, tokenizer may be custom-defined to accommodate the word segmentation requirements of chinese text, or the word segmentation may be performed using existing chinese word segmentation libraries (e.g., jieba). In addition, lowercase =false in the above code example is to keep the chinese vocabulary as it is, since case is not a problem in chinese. In this way a TF-IDF matrix is obtained, where each row represents the TF-IDF vector of a document, which can be used for subsequent text clustering.
S104, recursively grouping each vector in the TF-IDF vector matrix into a nested ball tree by using a ball tree construction algorithm to obtain a ball tree index space in which the TF-IDF vector is nested;
In the present application, individual vectors in the TF-IDF vector matrix are recursively grouped into nested ball trees using a ball tree construction algorithm. A spherical tree is a data structure for fast approximate nearest neighbor searching in a high-dimensional space.
The specific implementation process can be as follows:
One is selected from the TF-IDF vector matrix as an initial point. The initial point is taken as the root node and a sphere (or hyper-sphere) is constructed to contain it. A point around the root node is selected as a cut point and the dataset is then split into two parts according to the cut point. One part of the points are positioned in the sphere with the cutting point as the center, and the other part of the points are positioned outside the sphere with the cutting point as the center. The dicing step is repeated for each subset until a stop condition is met. The stop condition may be reaching a maximum depth, the number of data points in the node being less than a certain threshold, or other predefined condition. Each node is connected with its child nodes to form a ball tree.
More specifically, an appropriate cutting point, a radius of the ball, a stop condition, and the like can be determined. These choices will affect the structure and performance of the ball tree. By recursively dividing the data into smaller subsets, the ball tree is able to quickly conduct nearest neighbor searches in high-dimensional space, thereby providing support for subsequent cluster analysis.
S105, based on the ball tree index space, clustering all data points based on density to obtain a plurality of clusters;
a density-based clustering algorithm is performed on all data points based on the spherical tree index space. The purpose of clustering is to group data points such that the data points within the same group (cluster) are highly similar, while the data points of different groups are less similar.
In this step, a density-based clustering algorithm is performed on the basis of the sphere tree index space, and the structure of the sphere tree can be utilized to accelerate neighbor search, thereby improving the clustering efficiency. The following is a specific implementation of performing a density-based clustering algorithm on the spherical tree index space:
First a ball tree index needs to be constructed from the dataset. Ball Tree (Ball Tree) is a spatially partitioned data structure for fast neighbor searching that organizes data points by recursively partitioning the data space into a series of nested hyperspheres.
Appropriate density thresholds (e.g., eps, i.e., neighborhood radius) and minimum points (e.g., minPts) are selected.
Starting from the root node of the ball tree, the nodes which are not obviously in the neighborhood of the current point Eps are rapidly eliminated by utilizing the property of the ball tree. For each data point, all points within its Eps neighborhood are quickly found by the spherical tree. And judging whether the current point is a core point (namely, at least MinPts points exist in the Eps neighborhood of the current point).
If a point is a core point, a new cluster is created and all points in its Eps neighborhood (including core points and boundary points) are added to the cluster. For newly added points, if they are also core points, the cluster is continued to be expanded until no new points can be added. Boundary points are those points that lie in the neighborhood of core points Eps, but have fewer points in their own Eps neighborhood than MinPts. They are assigned to the cluster in which the nearest core point is located. Noise points are those points that are neither core points nor boundary points, which are considered isolated points during clustering, and do not belong to any cluster. In some cases, there may be multiple clusters that are very close to each other, where the clusters may be merged. The final output is a series of clusters, resulting in a cluster set, each cluster containing a set of highly similar data points, and data identified as noise points.
When the ball tree index is used for density-based clustering, the ball tree can rapidly exclude data far away from the query point, so that the number of data points needing to be examined in detail is greatly reduced, and the clustering efficiency is improved.
While one embodiment of the present application is provided above, other implementations are also possible.
S106, distributing a unique identification code for each cluster, and replacing all data points contained in the cluster with the identification codes;
In this step, each cluster is assigned a unique identification code. All data points in the cluster are replaced with this identification code. The purpose of assigning identification codes to clusters and replacing original data points is to enable data analysis and mining to be performed subsequently in units of clusters, including association rule learning. By aggregating similar data points into a cluster and representing with a unique identification code, the data set can be greatly simplified, making the subsequent data mining algorithm more efficient.
One specific implementation is provided below:
An integer variable or counter is created for generating a unique cluster identification code. It may start at 1 and increment each time.
For each cluster, a new unique identification code is assigned.
Each data point in the data set is traversed. For each data point it is checked which cluster it belongs to (cluster information to which each data point belongs has been recorded during the clustering process).
The value of the data point (or its representation in the dataset) is replaced with the unique identification code of the cluster. Or creating a data list for recording unique identification codes of all data points in the cluster.
To facilitate subsequent data analysis, a mapping table may be recorded that maps the unique identification code of each cluster to detailed information (e.g., center point, member point list, etc.) for that cluster.
S107, constructing a candidate 1-item set by taking a cluster as an item according to the unique identification code;
in this step, clusters are considered as a "term," each cluster effectively representing a set of similar data points. To construct a candidate 1-item set (i.e., an item set containing only one item), all independent cluster identification codes need to be listed, since each cluster is then considered a separate item.
All clusters are traversed and the identification code of each cluster is added as a separate item to the candidate 1-item set. In association rule mining, a set of items is a collection of items, while in the present application, each "item" is an identification code of a cluster. These candidate 1-item sets may be stored using a list, collection, or other data structure. For example, in Python, a simple list may be used to store the items.
An example of code for one such implementation is provided below:
# suppose cluster ids is a list containing all cluster identity codes
Cluster ids= [1, 2, 3, 4, 5] # candidate 1-item set is a list containing all cluster identification codes
Candidate_1_itemsets=cluster_ids.copy () # or direct use of cluster_ids as candidate 1-item set
# Output candidate 1-item set
Print (' candidate 1-item set: ")
for item in candidate_1_itemsets:
print(item)
In this example, candidate_1_itemsets are a candidate 1-item set in which each element represents a cluster (i.e., a collection of data points). In the subsequent association rule mining process, frequent item sets need to be found and association rules generated based on these candidate item sets.
It should be noted that, in the present application, the term set and the term are basic units of association rule mining, and are different from the cluster and the data point in the previous steps. In the present application, the cluster needs to be treated as one item of association rule mining.
S108, calculating the support degree of each item in the candidate 1-item set in a pre-constructed transaction library, and comparing the support degree with a preset first support degree threshold value, wherein the transaction library comprises a plurality of transactions, and the transactions consist of one or more items;
And calculating the support degree of each item in the candidate 1-item set in the pre-constructed transaction library. The transaction library contains a plurality of transactions, each consisting of one or more items (clusters). The support degree refers to the proportion of transactions that contain a certain item in all transactions.
The following provides a calculation method of the support degree:
the support of each item in the candidate 1-item set in the transaction is calculated as follows:
support(item_i)=count(item_i)/N;
Wherein the item contained in the candidate 1-item set is item_i, count (item_i) represents the occurrence times of the item in all transactions, and N is the total transaction number; the support (item_i) represents the support of item item_i.
S109, screening out items not smaller than the first support threshold value to obtain a frequent 1-item set, wherein each item in the frequent 1-item set is composed of a cluster;
The frequent item sets refer to those item sets whose support meets a given threshold. In step S109, items whose support degree is not less than a preset first support degree threshold value need to be screened out to form frequent 1-item sets. It is preferred that a first support threshold is determined and for each item in the candidate 1-item set, its support is compared to the first support threshold. If the support of a certain item is greater than or equal to the first support threshold, it is added to the frequent 1-item set. All the screened items constitute frequent 1-item sets.
The process of building frequent 1-item sets is illustrated by way of example below:
of course, this is possible. The following is a specific example of building frequent 1-item sets:
assuming a mini supermarket transaction database, a shopping list of 5 customers is recorded, and each transaction represents a shopping record of one customer.
Transaction 1 bread, milk
Transaction 2 bread, diaper, beer
Transaction 3 milk, diaper, peanut
Transaction 4 bread, milk, diaper, peanut
Transaction 5 diaper, peanut, toothpaste
For the present application, each item (commodity) in a transaction represents a cluster, and first, the support of each item (commodity) in all transactions needs to be calculated. The support means a ratio of the number of transactions containing a certain item to the total number of transactions. The total number of transactions was 5.
Next, count the number of occurrences of each item in the transaction:
Bread appeared in transactions 1,2,4 for a total of 3 times
Milk, present in transactions 1,3,4, 3 times total
Diapers appear in transactions 2,3,4,5 for a total of 4 times
Beer-appear in transaction 21 times
Peanut appears in transactions 3,4,5 for a total of 3 times
Toothpaste appeared in transaction 51 time
Now, the support degree of each item is calculated:
bread support 3/5=0.6
Milk support 3/5=0.6
Diaper support 4/5=0.8
Beer support degree 1/5=0.2
Support of peanut 3/5=0.6
1/5=0.2 Toothpaste support
Setting a minimum support threshold and screening frequent 1-item sets;
If the minimum support threshold is set to 0.4 (which means that the item must occur in at least 40% of transactions to be considered "frequent").
According to the threshold, the items with the support degree not lower than 0.4 are screened out to form frequent 1-item sets:
Bread with a degree of support of 0.6, frequently
Milk with a degree of support of 0.6, frequently
Diaper with 0.8 degree of support, frequent
Beer with a degree of support of 0.2 and is not frequent
Peanut with 0.6 support degree and frequent use
Toothpaste with a degree of support of 0.2 and no frequent
Thus, frequent 1-term sets are: { bread }, { milk }, { diaper }, { peanut }.
These items will be the basis for building larger frequent item sets (e.g., frequent 2-item sets, frequent 3-item sets, etc.).
S110, constructing a frequent n-item set based on the mode until the frequent n+1 item set cannot be constructed;
Based on the previous ways of step S108 and step S109, the frequent n-item set, i.e. the set comprising n items, is continuously built, and the support degree of these items in the transaction library is not lower than the preset support degree threshold value. This process continues until a frequent set of n+1 entries cannot be constructed.
Based on the example of constructing the frequent 1-item set in step S109, the procedure of constructing the frequent n-item set will be described further below by way of an example of constructing the frequent 2-item set:
based on the above example of building frequent 1-item sets, frequent 2-item sets may continue to be built. The following is a procedure for building frequent 2-item sets:
First, a candidate 2-item set is generated from the frequent 1-item set constructed in step S109. Frequent 1-item sets are: { bread }, { milk }, { diaper }, { peanut }.
By combining these frequent 1-items, the following candidate 2-item sets can be obtained:
{ bread, milk }
{ Bread, diaper }
{ Bread, peanut }
{ Milk, diaper })
{ Milk, peanut }
{ Diaper, peanut })
Next, the support of each candidate 2-item set in the transaction database needs to be calculated. The support refers to the ratio of the number of times an item set appears in a transaction to the total number of transactions.
From the previous transaction database:
transaction 1 bread, milk
Transaction 2 bread, diaper, beer
Transaction 3 milk, diaper, peanut
Transaction 4 bread, milk, diaper, peanut
Transaction 5 diaper, peanut, toothpaste
Counting the occurrence times of each candidate 2-item set:
{ bread, milk }
{ Bread, diaper }
{ Bread, peanut }. Appear only in transaction 4, 1 time total
{ Milk, diaper }
{ Milk, peanut }. Appear only in transaction 4, 1 time total
{ Diaper, peanut }
Now, the support for each candidate 2-item set is calculated (total transaction number 5):
Support of { bread, milk }, 2/5=0.4
Support of { bread, diaper }, 2/5=0.4
Support of { bread, peanut }, 1/5=0.2
Support of { milk, diaper }, 2/5=0.4
Support of { milk, peanut }. 1/5=0.2
{ Diaper, peanut } support 3/5=0.6
The minimum support threshold is set and frequent 2-term sets are screened, assuming that the minimum support threshold set before continuing to use is 0.4. Screening out candidate 2-item sets with support degree not lower than 0.4 according to the threshold value to form frequent 2-item sets:
{ bread, milk }. Support 0.4, frequent;
{ bread, diaper }. Support 0.4, frequent;
{ milk, diaper }. Support 0.4, frequent;
{ diaper, peanut }. Support 0.6, frequent;
The support of other candidate 2-item sets is below the threshold and is therefore not considered frequent.
Therefore, frequent 2-item sets include: { bread, milk }, { bread, diaper }, { milk, diaper }, { diaper, peanut }. These frequent 2-term sets may be used as a basis for further mining of larger scale frequent term sets or for generating association rules.
In this way, frequent n-term sets are constructed until a larger-scale n+1 frequent term set cannot be constructed.
S111, determining the items conforming to the preset confidence coefficient based on all the generated frequent item sets, and outputting the association rule.
In the step, based on all the generated frequent item sets, the items meeting the preset confidence level are determined. Confidence is a conditional probability that indicates the probability that result Y is also contained in a transaction that contains condition X. And outputting an association rule conforming to the preset confidence. These rules may be used in scenes such as recommendation systems, market analysis, etc.
For each frequent item set, multiple association rules may be generated. For example, for the frequent item set { bread, milk }, possible association rules are:
milk (bread- > milk) may also be purchased if bread is purchased
If milk is purchased, it is also possible to purchase bread (milk- > bread)
For each possible association rule, its confidence level needs to be calculated. Confidence is a conditional probability defined as:
Confidence (X-Y) =support (X, Y)/support (X);
In the association rule of X-Y, X is a condition, Y is a result, and the association rule X-Y indicates that under the condition that the condition X appears, the result Y also appears at the same time; the support degree (X, Y) represents the support degree of the frequent item (X, Y), and the support degree (X) represents the support degree of the frequent item (X);
Thus, a set of association rules may be generated based on the set of frequent items, the set of association rules comprising a plurality of association rules; after the confidence coefficient of each association rule is calculated, comparing the confidence coefficient of each association rule with a preset confidence coefficient threshold value, screening out the association rule meeting the confidence coefficient requirement, and outputting, wherein the confidence coefficient requirement comprises that the confidence coefficient is not smaller than the confidence coefficient threshold value.
The mining of association rules is further described by way of an example:
Assume the following frequent item sets and their support:
{ bread, milk }, support degree of 0.4
{ Diaper, peanut }, support degree of 0.3
{ Bread, diaper, milk }, support degree of 0.2
And the minimum confidence threshold is set to 0.6.
For the frequent item set { bread, milk }, two association rules may be generated:
bread- > milk:
Confidence = support (bread, milk)/support (bread) =0.4/support (bread)
Assuming support (bread) =0.6, confidence=0.4/0.6≡0.67>0.6 (minimum confidence is met)
Milk- > bread:
confidence = support (bread, milk)/support (milk) =0.4/support (milk)
Assuming support (milk) =0.7, confidence=0.4/0.7≡0.57<0.6 (minimum confidence is not met);
In this example, only the confidence level of the association rule "bread- > milk" exceeds the threshold, so the association rule is output.
The method provided by the application can be applied to various scenes, for example, by analyzing the shopping records of customers, which commodities are purchased together frequently can be found, which is very important for the shelf arrangement of supermarkets, promotion strategies and the construction of recommendation systems. For example, if it is found that customers often purchase bread and milk at the same time, a supermarket may put the two items together or make a binded sale.
In a social network, the user's behavior and interests may be analyzed by the behavior of content, praise, comments, etc. they post. The method can be used for discovering common interests and associations among user groups so as to conduct accurate advertisement delivery or community discovery. The method can also be applied to network security monitoring, and in the field of network security, a large amount of log data needs to be effectively analyzed and mined. The method can help to identify abnormal behavior patterns and discover network attacks and intrusion behaviors in time. The method can also be used in text mining and emotion analysis, and can be used for extracting topics, finding trends and analyzing public emotion tendencies for a large amount of text data such as news, comments, social media posts and the like.
The method provided by the present embodiment may help to reduce a large number of detailed data points to fewer, higher-level categories by clustering. This not only reduces the computational complexity of the subsequent association rule mining, but also makes the analysis more intuitive and easy to understand. During the clustering process, some outlier noise data or outliers may be identified and eliminated, thereby improving the accuracy and effectiveness of association rule mining. By clustering, hidden, meaningful clusters or patterns in the data can be found. These groups can serve as the basis for association rule mining to help identify associations and dependencies between different groups. The association rule mining is carried out on the basis of clustering, so that interesting and practical association rules are more likely to be found. Because these rules are generated based on natural groupings and patterns in the data, they are more likely to reflect true, meaningful relationships. Because the dimension and complexity of the data are reduced by clustering, the association rule mining algorithm can be run on smaller data sets, so that the calculation efficiency and the response speed are improved. Clustering results are generally easier to interpret and understand because they are grouped based on similarity of data. This makes the association rules mined on a clustered basis more interpretable, helping the decision maker to better understand and apply these rules.
In the embodiment of the application, the innovation optimizes the process of checking the adjacent area by constructing a Ball Tree (Ball Tree), and the method can bring various advantages. The spherical tree is an effective space division data structure, and is particularly suitable for fast neighbor search in a high-dimensional data space. The ball tree is able to quickly locate a neighborhood of a given point by dividing the data space into a series of nested spheres (or super spheres). This is particularly useful in association rule mining, as other items that are close to a particular item or set of items can be quickly found, thereby speeding up the discovery process of frequent sets of items and association rules.
The use of a spherical tree can significantly reduce the number of data points that need to be checked when looking for association rules, as compared to a conventional full data scan. The construction of the ball tree enables the algorithm to quickly exclude those data points that are not relevant to the current query, focusing on areas that are more likely to produce meaningful associations.
In high-dimensional data space, conventional neighbor search methods may become very inefficient. The spherical tree structure is particularly suitable for handling high-dimensional data because it reduces the search space by spatial division, making neighbor searches in high-dimensional data more efficient.
The ball tree structure can flexibly adapt to the distribution and density of data. In data-dense regions, the ball tree may divide space more finely, while in data-sparse regions, a coarser division may be employed. This flexibility helps to handle unevenly distributed data in association rule mining.
For large-scale data sets, the ball tree can be constructed and used in a hierarchical manner, so that good expandability is realized. This means that the ball tree can maintain efficient search performance even in the case where the data amount is greatly increased.
In step S105, it is mentioned that, in performing density-based clustering on all data points, a plurality of clusters may be obtained, and a specific implementation has been given in step S105, and another alternative implementation is provided below, which is specifically as follows:
s201, determining a preset neighborhood radius Eps and a minimum point number MinPts;
Before starting clustering, two important parameters need to be preset: neighborhood radius Eps and minimum point number MinPts. Eps defines a neighborhood of data points, while MinPts defines the minimum number of neighbors that a point needs to be considered as a core object. The selection of these two parameters has an important impact on the clustering result.
S202, for any data point in the ball tree index space, determining the number of data points contained in the EPS field by utilizing the ball tree index space;
This step is to use the ball tree index space to efficiently determine the number of other data points each data point contains within the EPS field. A ball tree is a data structure that can quickly retrieve the neighbors of a given point, thereby speeding up the neighborhood query process.
First, a ball tree is constructed using data points in the TF-IDF vector matrix. The data space may be recursively partitioned into spheres containing nearby points until the number of points in each sphere is less than a preset threshold or the maximum depth of the tree is reached.
Each sphere is defined by a center point, which may be the mean or median of all points in the sphere, and a radius, which is the maximum distance of the points from the center point.
For each data point, a ball tree is used to query for points within its EPS field. This may be achieved by recursively traversing the ball tree starting from the root node.
During traversal, for the current node (sphere), if its radius is less than Eps from the query point, all points within the node are checked to see if they are within the Eps field of the query point.
If the radius of the current node is greater than the query point to node center point distance plus Eps, then all points in the node and its children can be ignored because they are not likely to be within the Eps domain of the query point.
For other cases, the child nodes need to be checked recursively.
For each query point, the number of points within the EPS field is calculated and recorded. This can be achieved by incrementing a counter during traversal. Specific information of these neighboring points may also be recorded for subsequent processing.
S203, if the number of the data points contained in any data point is not less than the MinPts, determining the data point as a core object;
If a data point does not include less than MinPts in the EPS field, then the point is considered as a core object. The core object can expand the cluster.
S204, creating an empty output list for storing output data points;
An empty output list needs to be created to store the processed data points before the clustering process begins. This list will be populated step by step in subsequent steps.
S205, creating an empty priority queue;
The priority of the priority queue is arranged based on an reachable distance, the reachable distance represents the distance from a data point to the nearest core object, and if the data point is not processed, the corresponding reachable distance is set to be infinity;
In this embodiment, a priority queue is used to store the data points to be processed and order them according to their reachable distance to the nearest core object. The reachable distance represents the distance of a data point to its nearest core object. If a data point has not been processed, then its reachable distance is set to infinity.
S206, randomly selecting an unprocessed target point p from the ball tree index space, marking the processed target point p as processed, and adding the processed target point p into the output list;
This step is the beginning of the clustering process, where one data point that has not been processed is randomly selected as a starting point and marked as processed and then added to the output list. This starting point will act as a seed point for the cluster.
S207, traversing the whole ball tree index space until all data points are marked as processed by the following steps:
This step is the core of the clustering process, which finds and expands clusters by traversing the entire ball tree index space. This process continues until all data points are marked as processed.
S207a: if the target point p is a core object, determining the distance from the target point p to each unprocessed data point of the target point p in the EPS field by utilizing the ball tree index space, adding the distance to the priority queue, and updating the distance from the target point p to the nearest core object; if the data point is already in the priority queue and the distance from the data point to the target point p is smaller than the current reachable distance, updating the distance of the data point in the priority queue;
In this step, first, a core object is determined, and if the target point p is determined as the core object (i.e., the number of data points included in the EPS field thereof is not less than MinPts), the next step is performed.
For each data point q within the EPS field of the target point p, it is checked whether it has been processed.
If the data point q has not been processed, the distance q to p is calculated and this distance is taken as the achievable distance of q. Q is then added to the priority queue and ordered by reachable distance. If q is already in the priority queue and the new reachable distance is smaller than the reachable distance recorded in the queue, then the reachable distance of q is updated. After all data points in the EPS field for p have been processed, p is marked as accessed, ensuring that processing is not repeated.
S207b: taking out the data point with the smallest reachable distance from the priority queue, marking the data point as processed, and adding the data point into the output list;
And taking out the data point r with the smallest reachable distance from the priority queue. The data point r is marked as processed and added to the output list. This ensures that the processed points are not processed repeatedly.
S207c: judging whether the extracted data point is a core object or not;
It is checked whether the data point r fetched from the priority queue is a core object. This needs to be determined by querying the number of data points contained within the EPS field of r.
If r is the core object, repeating the step S207a, processing the unprocessed data points in the EPS field of r, and expanding the current cluster.
If r is not a core object (i.e. it is a boundary point or noise point), the expansion of the cluster is not performed, but the next step is continued.
S207d: judging whether the priority queue is empty or not;
See if the priority queue is empty. If empty, it indicates that there are no more data points to process.
If the priority queue is not empty, steps S207b and S207c are repeated to continue processing the next data point with the smallest reachable distance. If the cluster is empty, the traversal process is ended, and the process proceeds to step S208 to extract the cluster.
In S207a, if the target point p is a core object, then each unprocessed data point within its EPS field will be processed, including adding them to the priority queue and updating the reachable distance. In S207b, the data point with the smallest reachable distance is fetched from the priority queue for processing, including marking it as processed and adding it to the output list. In S207c, judging whether the fetched data point is a core object, if so, repeating the step S207a to expand the cluster; if not the core object (i.e., boundary point or noise point), the step S207d is continued. In S207d, it is checked whether the priority queue is empty to decide whether to continue the traversal process.
S208, after the ordered output list is output, cluster extraction is carried out on the ordered list, and a cluster set formed by a plurality of clusters is obtained.
After traversing the complete data set and obtaining an ordered output list, the last step is cluster extraction of this list. This process may be accomplished by dividing the successive core objects and their boundary points into the same cluster, while isolated data points (i.e., noise points) are not contained in any cluster. The finally obtained cluster set is the output result of the association rule.
When the method provided by the embodiment is applied to the multidimensional data processing analysis method, clustering is not performed based on distance, but is based on density, so that clusters with arbitrary shapes can be found, which cannot be achieved by distance-based algorithms such as K-Means. Unlike the conventional K-Means algorithm, the method of the present embodiment does not require a user to preset the number of clusters to be formed, and automatically determines the number of clusters according to the density of data. Traditional algorithms are very sensitive to the selection of initial centroids, which may result in completely different clustering results. The clustering result of the embodiment is not affected by the initial value, so that the result is more stable.
In the method provided by the application, the field inspection process during clustering is optimized by constructing the ball tree index space, and a specific embodiment for constructing the ball tree index space is provided below, and referring to fig. 3, the embodiment includes:
s301, selecting a vector from the TF-IDF vector matrix as a root node of a ball tree, and taking the vector as an original ball tree, wherein the original ball tree comprises all vectors in the TF-IDF vector matrix;
One vector is selected from the TF-IDF vector matrix as the root node of the ball tree. This vector may be any vector in the matrix, but is preferably chosen to be representative or at the data center location.
S302, dividing vectors in the original ball tree into a first group and a second group according to a first predetermined dividing value, wherein the distance from the vectors in the first group to the root node is smaller than or equal to the first dividing value, and the distance from the vectors in the second group to the root node is larger than the first dividing value;
A segmentation value (first segmentation value) is determined, which may be set based on a data distribution or experience. Then, based on this segmentation value, the vectors in the original ball tree (which now contains only the root node and all the vectors to be segmented) are segmented into two groups. The first group contains all vectors having a distance to the root node less than or equal to the first segmentation value and the second group contains all vectors having a distance to the root node greater than the first segmentation value.
This step is to split the dataset into smaller subsets in order to recursively construct child nodes of the ball tree. Through reasonable segmentation, the relative balance of the structure of the ball tree can be ensured, so that the efficiency of subsequent query is improved.
S303, recursively constructing leaf nodes for each group of vectors;
For each set of vectors (first set and second set) formed by the previous segmentation, steps similar to S301 and S302 are recursively performed. That is, a new vector is selected from each group as the root node of the subtree, and then the group of vectors is further partitioned into two smaller groups according to another partitioning value (which may be a different value or the same value, depending on the specific partitioning strategy). This process is recursively performed until a certain termination condition is met. A multi-level ball tree structure is finally formed by recursively partitioning the vector sets and constructing subtrees. This structure can efficiently support subsequent proximity search operations.
S304, repeating the steps for each leaf node until a termination condition is met, wherein the termination condition is that the number of vectors contained in each leaf node is smaller than a preset termination threshold value; when the termination condition is satisfied, a ball tree index space is formed.
In recursively constructing leaf nodes, the number of vectors contained in each leaf node is continuously checked. The recursive process is stopped when a certain leaf node, which becomes a terminal node of the ball tree, contains a number of vectors smaller than a preset termination threshold. This process applies to all recursive paths until the entire ball tree construction is complete.
The termination condition is set to prevent the ball tree from being too deep, thereby avoiding excessive computational costs in subsequent proximity searches. By reasonably setting the termination threshold, the depth and complexity of the tree can be reduced as much as possible while the search efficiency is ensured.
When all recursive paths meet the termination condition, the entire ball tree construction process is completed. At this point, a complete ball tree index space is formed that can be used for subsequent neighbor search operations.
The formation of the ball tree index space is to improve the efficiency of the neighbor search. By pre-building such an index structure, it is possible to locate the approximate neighborhood of any one vector within a constant time, thereby greatly speeding up the neighborhood search process.
In the above embodiment, in step S302, the data needs to be divided into two groups by the predetermined first division value, and the determination manner of the first division value may be various, and the present application provides a specific embodiment of the determination manner, referring to fig. 4, which includes:
s401, carrying out standardization processing on all TF-IDF vectors;
for each vector in the TF-IDF matrix, a normalization process is performed. This may include subtracting the mean (centering) and then dividing by the standard deviation (or performing other forms of scaling) such that the mean for each feature is 0 and the standard deviation is 1. The normalization is to eliminate dimension differences and numerical range differences between different features, so that subsequent mathematical operations are more accurate and effective.
In this embodiment, a specific implementation manner for normalizing the TF-IDF vector is provided:
Calculating the mean value and standard deviation of each TF-IDF vector;
For each TF-IDF vector in the TF-IDF vector matrix, performing a normalization process by:
y=(a-m)/s;
where a is the original TF-IDF value, m is the mean value of the corresponding TF-IDF features, and s is the standard deviation of the corresponding TF-IDF features.
S402, performing transposition operation on the standardized TF-IDF matrix, so that each row represents a TF-IDF characteristic, and each column represents a document;
The normalized TF-IDF matrix is transposed such that each row of the matrix represents a TF-IDF feature and each column represents a document.
This step is to prepare for the subsequent covariance matrix calculation. In the present embodiment, a covariance matrix is used to describe the relationship between features.
S403, performing product operation on the transposed TF-IDF matrix and the original TF-IDF matrix to obtain an intermediate matrix;
and (3) performing matrix multiplication operation on the transposed matrix (characterized by rows and the document is columns) and the original matrix (the document is rows and the characteristic is columns).
S404, dividing each element in the intermediate matrix by the total document number contained in the text data set to obtain a covariance matrix.
Dividing each element in the intermediate matrix obtained in the last step by the total document number in the text data set to obtain a covariance matrix. This step is to obtain a normalized covariance matrix, ensuring that the values in the matrix reflect the correlation between features, rather than being affected by the amount of data.
S405, carrying out eigenvalue decomposition on the covariance matrix to obtain eigenvalues and corresponding eigenvectors, wherein the eigenvalues represent variances of the data in the direction of the eigenvectors, and the eigenvectors represent importance of the data in the direction of the eigenvectors;
and carrying out eigenvalue decomposition on the covariance matrix to obtain eigenvalues and corresponding eigenvectors.
The eigenvalues and eigenvectors describe the principal direction and magnitude of change of the data. In the present embodiment, the eigenvalue represents the variance of the data in the direction of the eigenvector, that is, the degree of dispersion of the data in that direction; the feature vector represents the importance of the data in the direction of the feature vector.
S406, determining the number of projection features to be reserved according to a preset dimension reduction target, and determining corresponding feature vectors as projection features;
And determining the number of projection features (namely principal components) to be reserved according to a preset dimension reduction target (such as the percentage of reserved original data variance), and selecting corresponding feature vectors. This step is to achieve dimension reduction of the data, and by preserving several principal components (i.e., feature vectors) that are most important, the dimension of the data can be reduced while preserving the principal information of the data.
S407, calculating projection values of all TF-IDF vectors in the directions corresponding to the projection features based on the projection features to form a projection value sequence;
Using the selected feature vectors as projection matrices, projection values of the original TF-IDF vectors in the direction of these feature vectors are calculated, forming a sequence of projection values. This step is to project the original high-dimensional data into a low-dimensional space while preserving the main direction of change and information of the data. The projected data can be used for subsequent clustering operations, and the computational efficiency and the storage efficiency can be improved due to the reduction of the dimension.
S408, determining the median of the projection value sequence, and determining the median as the first segmentation value.
First, a sequence of projection values is ordered. Then, the median of this ordered sequence is found. If the sequence length is odd, the median is the number in the middle; if the sequence length is even, the median may be the average of the middle two numbers. Finally, this median is determined as the first segmentation value.
This step is to select an appropriate segmentation value to divide the data during the subsequent ball tree construction process. Using the median as the segmentation value is an effective method because it can ensure that the two sets of data after segmentation are approximately equal in number, thereby maintaining the balance of the ball tree. This balance is critical to improving the query efficiency of the ball tree. By setting the median to the first division value, the construction of the ball tree can be made more reasonable and efficient.
In this embodiment, by dimension reduction of the data, the dimension of the data can be effectively reduced, and main information in the data is retained. This not only reduces the cost of storage and computation, but also helps to increase the efficiency of subsequent machine learning tasks. The method of the embodiment can extract the main change direction, namely the main component, in the data, so that redundant information and noise in the original data are removed to a certain extent, and the data representation is more compact and effective. By constructing the ball tree index space, the method can quickly locate near areas, and greatly improves the efficiency of the near search. This is particularly important when processing large-scale text data sets.
In the construction process of the ball tree, the local structure information of the data can be effectively maintained by recursively dividing the data into smaller subsets. Meanwhile, the balance of the ball tree can be ensured by using the median as the segmentation value, so that the query performance is further improved. The data after dimension reduction is easier to visually display, and the distribution and the characteristics of the data are facilitated to be analyzed and understood.
In step S208 of the foregoing embodiment, after the ordered output list is output, cluster extraction is performed on the ordered list, so as to obtain a cluster set formed by a plurality of clusters. An embodiment of cluster extraction is provided in the present application, referring to fig. 5, and includes:
s501, drawing a scatter diagram based on the ordered output list, wherein the scatter diagram comprises an x-axis and a y-axis, the x-axis represents indexes of data points in the ball tree index space, and the y-axis represents reachable distances of the data points;
This step is to visualize the data points in the ordered output list. The x-axis of the scatter plot represents the index of the data points in the ball tree index space, which may be a sequential number, reflecting the position or order of the data points in the list. The y-axis represents the achievable distance of data points, which is an indicator of the similarity or distance between data points. Points with smaller reach are generally more similar.
S502, identifying and marking valley places on the scatter diagram;
in this step, it is necessary to determine the scatter plot, in particular the change in the y-axis (the reachable distance). Valley refers to a region where the distance to reach is relatively small and the data points are concentrated more, and is represented as a depression on the scatter plot. These valleys generally correspond to clusters in the data because the data points inside the clusters are relatively more compact and the reach is smaller.
In practice, there are many ways to identify the valley, for example, the first or second derivative of the reachable distance on the scatter plot may be calculated to identify the valley.
S503, determining a cluster segmentation threshold according to the recognized valley, wherein the cluster segmentation threshold is larger than a y value corresponding to the valley;
After identifying the valleys on the scatter plot, a cluster segmentation threshold needs to be set to divide the different clusters. This threshold should be greater than the valley corresponding y value (reachable distance) to ensure that different clusters can be separated. The choice of the threshold will directly affect the accuracy of the clustering result.
The approximate location of the valley is marked on the scatter plot. For each valley, the average or median of the reachable distances of the data points within the region is calculated, which can be taken as the representative y-value of the valley.
A value slightly larger than the representative y value of valley is selected as the cluster segmentation threshold. This value should be large enough to ensure that different clusters can be separated, but not so large as to separate data points that would otherwise belong to the same cluster. The optimal cluster segmentation threshold can be found through multiple attempts and adjustments. Initially, a relatively conservative threshold may be set and then adjusted based on the clustering result.
And clustering the data by using the set cluster segmentation threshold. Checking the clustering result ensures that the data points in the same cluster have similar characteristics, and the data points among different clusters have larger differences.
If the clustering result is found to be undesirable, if some data points that should belong to the same cluster are incorrectly segmented into different clusters, or if data points in different clusters are incorrectly classified into the same cluster, then the cluster segmentation threshold needs to be adjusted.
S504, traversing the output list graph according to the cluster segmentation threshold, and distributing continuous data points to a cluster, wherein the reachable distance of the data points in the cluster is smaller than or equal to the corresponding cluster segmentation threshold;
In this step, the entire output list (or scatter plot) is traversed, assigning data points to different clusters according to the comparison of their reachable distance to the cluster segmentation threshold. If the reachable distance of the data point is less than or equal to the cluster segmentation threshold, it is considered part of the current cluster. If the reachable distance of the data point is larger than the cluster segmentation threshold value, the current data point is not in the current cluster, and a new cluster needs to be started.
When a new cluster is found to start (i.e., the reachable distance of the current data point is greater than the cluster segmentation threshold), the cluster that has been currently constructed is added to the cluster list, and the current cluster is emptied, ready to construct a new cluster. The current data point is added to the new current cluster and the traversal of the next data point is continued.
After traversing all the data points in the output list, the last constructed current cluster is added to the cluster list (the cluster list may be constructed in advance).
S505, when the reachable distance of the data point is higher than the current cluster segmentation threshold, starting a new cluster until the output list is traversed, and obtaining a cluster set.
When the reachable distance of one data point is encountered during traversal is above the current cluster segmentation threshold, this means the end of the current cluster and the start of the next cluster. The current cluster may be disconnected here and a new cluster started. This process continues until the entire output list is traversed, eventually resulting in a cluster set of clusters.
The embodiment provides a visual clustering method based on reachable distance and scatter diagram observation. It relies on the identification of valleys in the scatter plot and the rational setting of cluster segmentation thresholds to effectively extract clusters from the ordered output list. This approach may be particularly effective when processing data sets with significant density differences or complex shapes.
In the embodiment corresponding to fig. 1, when the association rule of each cluster is mined, the frequent item sets are determined based on the support degree of calculating the item sets of each level, however, when the item sets are large in scale, some frequent item sets may be ignored in this way, and therefore, the application also provides another embodiment, in which a tree data structure is constructed to mine out new frequent item sets, and the embodiment is described in detail below with reference to fig. 6:
S601, acquiring multi-source heterogeneous service data through a pre-configured data interface, wherein the multi-source heterogeneous service data comprises structured service data and unstructured service data;
s602, extracting features from the unstructured service data, storing the extracted features, and storing the extracted features and the structured service data together into a preset data structure to form a text data set;
s603, traversing the text data set, constructing TF-IDF vectors of all documents in the text data set, and forming a TF-IDF vector matrix;
S604, recursively grouping each vector in a TF-IDF vector matrix into a nested ball tree by using a ball tree construction algorithm to obtain a ball tree index space in which the TF-IDF vector is nested;
s605, based on the ball tree index space, clustering all data points based on density to obtain a plurality of clusters;
S606, allocating a unique identification code for each cluster, and replacing all data points contained in the cluster with the identification codes;
S607, constructing a candidate 1-item set by taking a cluster as an item according to the unique identification code;
S608, calculating the support degree of each item in the candidate 1-item set in a pre-constructed transaction library, and comparing the support degree with a preset first support degree threshold value, wherein the transaction library comprises a plurality of transactions, and the transactions consist of one or more items;
s609, screening out items not smaller than the first support threshold value to obtain a frequent 1-item set, wherein each item in the frequent 1-item set is composed of a cluster;
s610, constructing a frequent n-item set based on the mode until an n+1 frequent item set cannot be constructed;
The implementation of steps S601 to S610 in this embodiment is similar to that of steps S101 to S110 in the previous embodiment, and will not be repeated here.
S611, sorting the support degrees in a descending order based on the support degrees of the items in the frequent 1-item set, and generating an item header table, wherein the item header table stores names and the support degrees of the items;
first, the frequency 1 is counted-the support of each item in the set of items, i.e., the frequency with which each item appears in the data set. Then, the items are sorted in descending order according to the degree of support. The purpose of the ordering is to enable more efficient generation and traversal of the tree when the tree data structure is subsequently built. Finally, a header table is generated. The item header table is a data structure for storing the ordered items and their corresponding support. This table will be the basis for the subsequent steps.
S612, constructing a tree data structure of each item based on the item header table;
Using the entries in the header table as nodes, a Tree data structure, such as FP-Tree (frequent pattern Tree), is built layer by layer in descending order of support.
In building the tree, each node holds the name of its item, the support count, and a pointer (if any) to the next node with the same item. The tree structure can effectively compress data and retain association information among item sets.
Specifically, a root node of the attribute data structure may be created; starting from the root node, according to the ordering of each item in the item header table, sequentially growing the items in each transaction in the transaction library on the root node to obtain a tree data structure.
The following is a specific example to illustrate how to construct a Tree data structure (e.g., FP-Tree) from a header table and a transaction library.
Assume the following transaction library (each transaction represents a shopping list and each commodity represents a cluster):
transaction 1 bread, milk
Transaction 2 bread, diaper, beer
Transaction 3 milk, diaper, orange juice
Transaction 4 bread, milk, diaper, beer
Transaction 5 diaper, beer
First, the support degree of each item is counted (the manner of calculation of the support degree is already given in the foregoing embodiment):
Bread 3
Milk 3
Diaper 4
Beer 3
Orange juice 1
Then, the items are sorted in descending order according to the support degree, and an item header table is generated:
the following begins to build the FP-Tree:
First a root node is created, which may be named "root" which does not contain any items, but only serves as the starting point of the tree.
From the root node, the entries in the transaction library are added sequentially to the tree in accordance with the ordering in the entry header table.
For transaction 1 (bread, milk), starting from the root node, add "bread" as child node (if not already present) and increment a count on that node. Then, the "milk" node is added under the "bread" node and counted.
For transaction 2 (bread, diaper, beer), from the root node, add the "bread", "diaper" and "beer" nodes in turn (if not already present), and increment a count on each node.
Similarly, transaction 3, transaction 4, and transaction 5 continue to be processed.
When adding a node, if a certain item already exists on the current path of the tree, only the count of the node is increased, and no new node is added.
Referring to fig. 7, the final FP-Tree constructed may be as shown in fig. 7 (specific count numbers are omitted for simplicity):
s613, a new frequent item set is mined through the tree data structure;
Starting from the root node of the tree data structure, a set of frequent items is found by traversing the tree. An FP-Growth algorithm may be used to traverse the tree and generate a candidate set of frequent item sets. During traversal, which term sets are frequent is determined according to a preset minimum support threshold.
Specifically, the item header table further includes a node position pointer of the item in the tree data structure;
traversing the tree data structure for a target item in the item header table based on a node position pointer corresponding to the target item to obtain an item path containing the target item;
Traversing the whole item head table, and storing all obtained item paths as a new frequent item set.
S614, combining the new frequent item set obtained by excavation with the original frequent item set to obtain a final frequent item set;
And merging the new frequent item set mined through the tree data structure with the previous frequent item set. Care is taken to remove duplicate sets of items during merging to ensure that the final set of frequent sets of items does not contain duplicate elements.
S615, determining the items conforming to the preset confidence coefficient based on all the generated frequent item sets, and outputting the association rule.
For each item set in the merged frequent item set, its confidence is calculated. Confidence is a conditional probability that represents the probability that Y is also included in a transaction that includes X. And judging which association rules are strong association rules according to a preset confidence threshold value. And outputting all the strong association rules conforming to the preset confidence level. These rules can be used to guide a number of fields of decision making, recommendation systems, market analysis, etc.
The present embodiment more efficiently mines frequent item sets by building a tree data structure, especially when the item set size is large. This approach can discover frequent sets of terms that may be ignored by conventional approaches and provide more comprehensive association rule analysis results.
In order to evaluate the clustered results in the present application, referring to fig. 8, an embodiment is further provided in the present application, where the embodiment includes:
S801, multi-source heterogeneous service data is obtained through a pre-configured data interface, wherein the multi-source heterogeneous service data comprises structured service data and unstructured service data;
S802, extracting features from the unstructured service data, storing the extracted features, and storing the extracted features and the structured service data together into a preset data structure to form a text data set;
s803, traversing the text data set, constructing TF-IDF vectors of all documents in the text data set, and forming a TF-IDF vector matrix;
S804, recursively grouping each vector in the TF-IDF vector matrix into a nested ball tree by using a ball tree construction algorithm to obtain a ball tree index space in which the TF-IDF vector is nested;
S805, based on the ball tree index space, performing density-based clustering on all data points to obtain a plurality of clustering clusters;
the implementation of steps S801 to S805 in the embodiment is similar to that of steps S101 to S105 in the previous embodiment, and will not be repeated here.
S806, evaluating the clustering effect of each cluster in the cluster set to obtain an evaluation result;
in this step, the cluster set obtained by the clustering algorithm needs to be evaluated to determine the effect and quality of clustering. By evaluating the clusters, the compactness, the separability and the degree of coincidence of the clustering result and the data real structure can be helped to be understood. The clustering effect can be evaluated by the following methods:
The quality of a cluster, such as a contour coefficient (Silhouette Coefficient), can be evaluated based on the similarity between data points in the cluster result, which measures the closeness of data points within the same cluster as well as the separability between different clusters.
If a true class label is available, an external evaluation index, such as an adjusted Rand index (Adjusted Rand Index, ARI), normalized mutual information (Normalized Mutual Information, NMI), etc., may be used to measure the consistency of the clustering result with the true class.
For two-dimensional or three-dimensional data, clustering effects such as scatter plots, heat maps, etc. can also be visually evaluated by visualization methods.
Whether the clustering result meets expectations or not is evaluated according to specific business targets and domain knowledge, such as whether the identified clustering clusters can reflect market segments, customer groups and the like.
By applying these evaluation methods in combination we can get a comprehensive evaluation result that will indicate whether further optimization of the clustering parameters is required.
A specific evaluation method of the evaluation result is provided below:
for each cluster, the intra-class sum of squares ss_w is calculated by:
SS_W=Σ_{i=1}{k}Σ_{x∈C_i}||x-μ_i||2;
Wherein ss_w represents the sum of squares within the class, i represents the i-th cluster, i=1, 2,..k, x e Ci represents each element in cluster Ci, |x- μ_i|| 2 represents the square of the distance of element x to cluster centroid μ_i;
for each cluster, the sum of squares between classes is calculated by the following equation:
SS_B=Σ_{i=1}{k}n_i||\bar{x}_i-\bar{x}||2;
Wherein ss_b represents the sum of squares between classes, i represents the i-th cluster, i=1, 2,..k, n_i represents the number of elements in cluster c_i, ||\bar { x } -i\bar { x } | 2 represents the square of the distance from cluster centroid\bar { x } -i to centroid\bar { x } of the cluster set;
The evaluation index was calculated by the following formula:
CH-Index={SS_B/(k-1)}/{SS_W/(n-k)};
Where k represents the number of clusters and n represents the total number of TF-IDF vectors contained in the spherical tree index space.
And S807, adjusting the value of the field radius Eps or the minimum point MinPts according to the evaluation result.
After the evaluation result of the clustering effect is obtained, if the clustering effect is found to be not ideal, such as too many or too few clusters, unclear clustering boundaries, or that data points contained in some clusters are inconsistent with business logic, two key parameters in the clustering process need to be adjusted: domain radius Eps (epsilon) and minimum point number MinPts.
Adjusting the radius Eps of the field:
If the evaluation result shows that the clusters are too scattered, the value of the Eps can be reduced, so that the requirement on adjacent points in clustering is more strict, and more compact clusters can be formed.
If clusters are too few, or many points are falsely marked as noisy points (i.e., points that do not belong to any cluster), an attempt may be made to increase the value of Eps to relax the requirements on neighboring points so that more points can be classified into clusters.
Regulating the minimum point number MinPts:
If the evaluation result shows a large number of noise points, or the number of clusters is too large, it may be because the value of MinPts is set too high. In this case, an attempt may be made to reduce the value of MinPts so that fewer points can form a cluster.
If clusters are too large or if it is desired that the algorithm be more sensitive to noisy points, an attempt may be made to increase the value of MinPts so that more points are needed to form a cluster, and isolated points are more likely to be identified as noisy points.
Based on the evaluation result CH-Index in the aforementioned step S806, if the CH-Index is greater than a preset expected value, decreasing the radius Eps of the domain or increasing the minimum point MinPts;
and if the CH-Index is smaller than the preset expected value, increasing the radius Eps of the field or reducing the minimum point MinPts.
S808, distributing a unique identification code for each cluster, and replacing all data points contained in the cluster with the identification codes;
s809, constructing a candidate 1-item set by taking a cluster as an item according to the unique identification code;
S810, calculating the support degree of each item in the candidate 1-item set in a pre-constructed transaction library, and comparing the support degree with a preset first support degree threshold value, wherein the transaction library comprises a plurality of transactions, and the transactions consist of one or more items;
s811, screening out items not smaller than the first support threshold value to obtain a frequent 1-item set, wherein each item in the frequent 1-item set is composed of a cluster;
S812, constructing a frequent n-item set based on the mode until the frequent n+1 item set cannot be constructed;
S813, determining the items conforming to the preset confidence level based on all the generated frequent item sets, and outputting the association rule.
Steps S808 to S813 in this embodiment are similar to steps S601 to S111 in the foregoing embodiment, and are not repeated here.
The above method describes in detail embodiments of the method provided in the present application, and the following describes embodiments of the apparatus provided in the present application:
referring to fig. 9, the present application provides an embodiment of a multidimensional data processing analysis apparatus, the embodiment comprising:
A data obtaining unit 1001, configured to obtain multi-source heterogeneous service data through a pre-configured data interface, where the multi-source heterogeneous service data includes structured service data and unstructured service data;
The feature extraction unit 1002 is configured to perform feature extraction on the unstructured service data, store the extracted feature data, and store the extracted feature data and the structured service data together in a preset data structure to form a text data set;
A vector construction unit 1003, configured to traverse the text data set, construct TF-IDF vectors of all documents in the text data set, and form a TF-IDF vector matrix;
A ball tree construction unit 1004, configured to recursively group each vector in the TF-IDF vector matrix into nested ball trees using a ball tree construction algorithm, to obtain a ball tree index space in which the TF-IDF vector is nested;
a clustering unit 1005, configured to perform density-based clustering on all data points based on the spherical tree index space, to obtain a plurality of clusters;
An identifier code allocation unit 1006, configured to allocate a unique identifier code to each cluster, and replace all data points included in the cluster with the identifier code;
a candidate item set construction unit 1007, configured to construct a candidate 1-item set by using a cluster as an item according to the unique identification code;
A support degree calculating unit 1008, configured to calculate a support degree of each item in the candidate 1-item set in a pre-constructed transaction library, and compare the support degree with a preset first support degree threshold, where the transaction library includes a plurality of transactions, and the transactions are composed of one or more items;
the item set screening unit 1009 is configured to screen out items not less than the first support threshold to obtain frequent 1-item sets, where each item in the frequent 1-item sets is formed by a cluster;
A frequent item set construction unit 1010, configured to construct a frequent n-item set based on the above manner until a frequent item set of n+1 cannot be constructed;
and an association rule output unit 1011 for determining items conforming to a preset confidence level based on all the generated frequent item sets, and outputting an association rule.
Optionally, the vector construction unit 1003 is specifically configured to:
for any vocabulary in the text data set, calculating the occurrence frequency of the vocabulary in the text data set to obtain word frequency TF;
Calculating the document proportion of the documents containing the vocabulary in the text data set to the total document number, and calculating the logarithm of the document proportion to obtain an inverse document frequency IDF;
traversing the text data set, and constructing TF-IDF vectors according to the obtained TF and IDF;
and constructing a TF-IDF vector matrix for all documents in the text data set by using the obtained TF-IDF vector.
Optionally, the vector construction unit 1003 is specifically configured to:
Word segmentation processing is carried out on any text data in the text data set;
Calculating word frequencies of various words in the text data by the following formula:
TF(t, D) = C(t, D) / T(D);
Wherein:
TF represents the vocabulary frequency;
t represents a given vocabulary;
D represents a given document;
C (t, D) represents the number of times the vocabulary t appears in the document D;
T (D) represents the total vocabulary in document D.
Optionally, the vector construction unit 1003 is specifically configured to:
the inverse document frequency in the text data is calculated by:
IDF(t)=log_e(N_doc/df(t));
Wherein:
IDF represents the inverse document frequency;
t represents a given vocabulary;
n_doc represents the total number of documents in the text dataset;
df (t) represents the number of documents containing the vocabulary t;
log_e is a logarithmic function based on a natural constant e.
Optionally, the vector construction unit 1003 is specifically configured to:
The TF-IDF vector is calculated by the following equation:
TF-IDF(t,D)=TF(t,D)*IDF(t)。
Optionally, the clustering unit 1005 is specifically configured to:
Determining a preset neighborhood radius Eps and a minimum point number MinPts;
For any data point in the ball tree index space, determining the number of data points contained in the EPS field by utilizing the ball tree index space;
If the number of the data points contained in any data point is not less than the MinPts, determining the data point as a core object;
Creating an empty output list for storing the output data points;
Creating an empty priority queue, wherein the priorities of the priority queues are arranged based on reachable distances, the reachable distances represent the distances from the data points to the nearest core object, and if the data points are not processed, the corresponding reachable distances are set to infinity;
Randomly selecting an unprocessed target point p from the ball tree index space, marking the processed target point p, and adding the processed target point p into the output list;
The entire ball tree index space is traversed by the following steps until all data points are marked as processed:
Step a: if the target point p is a core object, determining the distance from the target point p to each unprocessed data point of the target point p in the EPS field by utilizing the ball tree index space, adding the distance to the priority queue, and updating the distance from the target point p to the nearest core object;
If the data point is already in the priority queue and the distance from the data point to the target point p is smaller than the current reachable distance, updating the distance of the data point in the priority queue;
step b: taking out the data point with the smallest reachable distance from the priority queue, marking the data point as processed, and adding the data point into the output list;
step c: if the extracted data point is a core object, repeating the step a;
Step d: if the priority queue is not empty, repeating the steps b and c;
after the ordered output list is output, cluster extraction is carried out on the ordered list, and a cluster set formed by a plurality of clusters is obtained.
Optionally, the vector construction unit 1003 is specifically configured to:
The TF-IDF values of all words in the document are formed into TF-IDF vectors;
The TF-IDF vectors of all documents are arranged according to rows to form a TF-IDF vector matrix;
The number of rows of the matrix is the number of documents, and the number of columns is the number of words.
Optionally, the ball tree construction unit 1004 is specifically configured to:
selecting a vector from the TF-IDF vector matrix as a root node of a ball tree, and taking the vector as an original ball tree, wherein the original ball tree comprises all vectors in the TF-IDF vector matrix;
Dividing the vectors in the original ball tree into a first group and a second group according to a first predetermined dividing value, wherein the distance from the vectors in the first group to the root node is smaller than or equal to the first dividing value, and the distance from the vectors in the second group to the root node is larger than the first dividing value;
recursively constructing leaf nodes for each set of vectors;
Repeating the steps for each leaf node until a termination condition is met, wherein the termination condition is that the number of vectors contained in each leaf node is smaller than a preset termination threshold value; when the termination condition is satisfied, a ball tree index space is formed.
Optionally, the tree building unit 1004 includes a first segmentation value determining module 10041, specifically configured to:
determining projection characteristics, and calculating projection values of all TF-IDF vectors in the directions corresponding to the projection characteristics based on the projection characteristics to form a projection value sequence;
a median of the sequence of projection values is determined and the median is determined as the first segmentation value.
The first segmentation value determination module 10041 is specifically configured to:
carrying out standardization processing on all TF-IDF vectors;
Calculating a covariance matrix of the normalized TF-IDF vector matrix;
performing eigenvalue decomposition on the covariance matrix to obtain eigenvalues and corresponding eigenvectors, wherein the eigenvalues represent variances of the data in the direction of the eigenvectors, and the eigenvectors represent importance of the data in the direction of the eigenvectors;
and determining the number of projection features to be reserved according to a preset dimension reduction target, and determining corresponding feature vectors as projection features.
The first segmentation value determination module 10041 is specifically configured to:
Calculating the mean value and standard deviation of each TF-IDF vector;
For each TF-IDF vector in the TF-IDF vector matrix, performing a normalization process by:
y=(a-m)/s;
where a is the original TF-IDF value, m is the mean value of the corresponding TF-IDF features, and s is the standard deviation of the corresponding TF-IDF features.
The first segmentation value determination module 10041 is specifically configured to:
performing transposition operation on the standardized TF-IDF matrix so that each row represents a TF-IDF characteristic and each column represents a document;
Performing product operation on the transposed TF-IDF matrix and the original TF-IDF matrix to obtain an intermediate matrix;
Dividing each element in the intermediate matrix by the total document number contained in the text data set to obtain a covariance matrix.
Optionally, the clustering unit 1005 is specifically configured to:
Drawing a scatter diagram based on the ordered output list, wherein the scatter diagram comprises an x-axis and a y-axis, the x-axis represents the index of a data point in the ball tree index space, and the y-axis represents the reachable distance of the data point;
Identifying and marking valleys on the scatter plot;
determining a cluster segmentation threshold according to the identified valley, wherein the cluster segmentation threshold is larger than a y value corresponding to the valley;
Traversing the output list graph according to the cluster segmentation threshold value, and distributing continuous data points to a cluster, wherein the reachable distance of the data points in the cluster is smaller than or equal to the corresponding cluster segmentation threshold value;
and when the reachable distance of the data point is higher than the current cluster segmentation threshold value, starting a new cluster until the output list is traversed, and obtaining a cluster set.
Optionally, the clustering unit 1005 is specifically configured to:
first or second derivatives of the reachable distances on the scatter plot are calculated to identify valleys.
Optionally, the clustering unit 1005 is specifically configured to:
and calculating the number of data points contained in each cluster, and eliminating the clusters smaller than a preset number threshold.
Alternatively, the candidate set construction unit 1007 specifically functions to:
the support of each item in the candidate 1-item set in the transaction is calculated as follows:
support(item_i)=count(item_i)/N;
Wherein the item contained in the candidate 1-item set is item_i, count (item_i) represents the occurrence times of the item in all transactions, and N is the total transaction number; the support (item_i) represents the support of item item_i.
Optionally, the method further includes a frequent item set mining unit 1012, specifically configured to:
Based on the support degree of each item in the frequent 1-item set, ordering the support degree in a descending order, and generating an item header table, wherein the item header table stores names and support degrees of the items;
Constructing a tree data structure of each item based on the item header table;
The new frequent item set is mined through the tree data structure;
And merging the new frequent item set obtained by mining with the original frequent item set to obtain a final frequent item set.
Optionally, the frequent item set mining unit 1012 is specifically configured to:
Creating a root node of the attribute data structure;
Starting from the root node, according to the ordering of each item in the item header table, sequentially growing the items in each transaction in the transaction library on the root node to obtain a tree data structure.
Optionally, the frequent item set mining unit 1012 is specifically configured to:
the item header table also comprises a node position pointer of the item in the tree data structure;
traversing the tree data structure for a target item in the item header table based on a node position pointer corresponding to the target item to obtain an item path containing the target item;
Traversing the whole item head table, and storing all obtained item paths as a new frequent item set.
Optionally, the association rule output unit 1011 is specifically configured to:
Generating an association rule set based on the frequent item set, wherein the association rule set comprises a plurality of association rules;
The rule confidence of each association rule is calculated as follows:
For association rule X-Y, then there are:
Confidence (X-Y) =support (X, Y)/support (X);
In the association rule of X-Y, X is a condition, Y is a result, and the association rule X-Y indicates that under the condition that the condition X appears, the result Y also appears at the same time; the support degree (X, Y) represents the support degree of the frequent item (X, Y), and the support degree (X) represents the support degree of the frequent item (X);
Based on the confidence coefficient of each association rule, comparing the confidence coefficient with a preset confidence coefficient threshold value, screening out the association rule meeting the confidence coefficient requirement, and outputting, wherein the confidence coefficient requirement comprises that the confidence coefficient is not smaller than the confidence coefficient threshold value.
Optionally, the method further comprises a cluster evaluation unit 1013, specifically configured to:
evaluating the clustering effect of each cluster in the cluster set to obtain an evaluation result;
and adjusting the value of the field radius Eps or the minimum point MinPts according to the evaluation result.
Optionally, the cluster evaluation unit 1013 is specifically configured to:
for each cluster, the intra-class sum of squares ss_w is calculated by:
SS_W=Σ_{i=1}{k}Σ_{x∈C_i}||x-μ_i||2;
Wherein ss_w represents the sum of squares within the class, i represents the i-th cluster, i=1, 2,..k, x e Ci represents each element in cluster Ci, |x- μ_i|| 2 represents the square of the distance of element x to cluster centroid μ_i;
for each cluster, the sum of squares between classes is calculated by the following equation:
SS_B=Σ_{i=1}{k}n_i||\bar{x}_i-\bar{x}||2;
Wherein ss_b represents the sum of squares between classes, i represents the i-th cluster, i=1, 2,..k, n_i represents the number of elements in cluster c_i, ||\bar { x } -i\bar { x } | 2 represents the square of the distance from cluster centroid\bar { x } -i to centroid\bar { x } of the cluster set;
The evaluation index was calculated by the following formula:
CH-Index={SS_B/(k-1)}/{SS_W/(n-k)};
Where k represents the number of clusters and n represents the total number of TF-IDF vectors contained in the spherical tree index space.
Optionally, the cluster evaluation unit 1013 is specifically configured to:
If the CH-Index is larger than a preset expected value, decreasing the field radius Eps or increasing the minimum point MinPts;
and if the CH-Index is smaller than the preset expected value, increasing the radius Eps of the field or reducing the minimum point MinPts.
Optionally, the identifier code allocation unit 1006 is specifically configured to:
Creating an integer variable for generating a unique cluster identification code;
For each cluster, assigning a new unique identification code using the integer variable;
traversing each data point in all clusters;
for each data point, determining the cluster to which the data point belongs, and replacing the value of the data point with the unique identification code of the cluster.
The application also provides a multidimensional data processing and analyzing system, which comprises:
A processor 1101, a memory 1102, an input-output unit 1103, and a bus 1104;
the processor 1101 is connected to the memory 1102, the input/output unit 1103 and the bus 1104;
the memory 1102 holds a program, and the processor 1101 calls the program to execute any of the methods described above.
The application also relates to a computer readable storage medium having a program stored thereon, which when run on a computer causes the computer to perform any of the methods described above.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.
In the several embodiments provided in the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random-access memory (RAM, random access memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Claims (27)
1. A method of multidimensional data processing analysis, the method comprising:
Acquiring multi-source heterogeneous service data through a pre-configured data interface, wherein the multi-source heterogeneous service data comprises structured service data and unstructured service data;
Extracting features from the unstructured service data, storing the extracted features, and storing the extracted features and the structured service data together into a preset data structure to form a text data set;
traversing the text data set, constructing TF-IDF vectors of all documents in the text data set, and forming a TF-IDF vector matrix;
Recursively grouping each vector in a TF-IDF vector matrix into a nested ball tree by using a ball tree construction algorithm to obtain a ball tree index space in which the TF-IDF vector is nested;
based on the ball tree index space, clustering all data points based on density to obtain a plurality of clusters;
assigning a unique identification code to each cluster, and replacing all data points contained in the cluster with the identification codes;
According to the unique identification code, constructing a candidate 1-item set by taking a cluster as an item;
calculating the support degree of each item in the candidate 1-item set in a pre-constructed transaction library, and comparing the support degree with a preset first support degree threshold value, wherein the transaction library comprises a plurality of transactions, and the transactions consist of one or more items;
Screening out items not smaller than the first support threshold value to obtain a frequent 1-item set, wherein each item in the frequent 1-item set consists of a cluster;
Based on the mode, constructing a frequent n-item set until an n+1 frequent item set cannot be constructed;
and determining the items conforming to the preset confidence coefficient based on all the generated frequent item sets, and outputting the association rule.
2. The method of multidimensional data processing analysis of claim 1, wherein traversing the text dataset to construct TF-IDF vectors for all documents in the text dataset and forming a TF-IDF vector matrix comprises:
for any vocabulary in the text data set, calculating the occurrence frequency of the vocabulary in the text data set to obtain word frequency TF;
Calculating the document proportion of the documents containing the vocabulary in the text data set to the total document number, and calculating the logarithm of the document proportion to obtain an inverse document frequency IDF;
traversing the text data set, and constructing TF-IDF vectors according to the obtained TF and IDF;
and constructing a TF-IDF vector matrix for all documents in the text data set by using the obtained TF-IDF vector.
3. The method of multidimensional data processing analysis of claim 2, wherein calculating the frequency of occurrence of any word in the text data set for the word in the text data set, the word frequency TF comprising:
Word segmentation processing is carried out on any text data in the text data set;
Calculating word frequencies of various words in the text data by the following formula:
TF(t,D)=C(t,D)/T(D);
Wherein:
TF represents the vocabulary frequency;
t represents a given vocabulary;
D represents a given document;
c (t, D) represents the number of times the vocabulary t appears in the document D;
T (D) represents the total vocabulary in document D.
4. The method of multidimensional data processing and analysis according to claim 2, wherein calculating a document proportion of documents in the text dataset, the documents containing the vocabulary, to a total document number, and calculating a logarithm of the document proportion, the obtaining an inverse document frequency IDF includes:
the inverse document frequency in the text data is calculated by:
IDF(t)=log_e(N_doc/df(t));
Wherein:
IDF represents the inverse document frequency;
t represents a given vocabulary;
n_doc represents the total number of documents in the text dataset;
df (t) represents the number of documents containing the vocabulary t;
log_e is a logarithmic function based on a natural constant e.
5. The multi-dimensional data processing analysis method of claim 2, wherein constructing TF-IDF vectors from the obtained TF and IDF comprises:
The TF-IDF vector is calculated by the following equation:
TF-IDF(t,D)=TF(t,D)*IDF(t)。
6. the multi-dimensional data processing analysis method of claim 1, wherein performing density-based clustering on all data points based on the spherical tree index space to obtain a plurality of clusters comprises:
Determining a preset neighborhood radius Eps and a minimum point number MinPts;
For any data point in the ball tree index space, determining the number of data points contained in the EPS field by utilizing the ball tree index space;
If the number of the data points contained in any data point is not less than the MinPts, determining the data point as a core object;
Creating an empty output list for storing the output data points;
Creating an empty priority queue, wherein the priorities of the priority queues are arranged based on reachable distances, the reachable distances represent the distances from the data points to the nearest core object, and if the data points are not processed, the corresponding reachable distances are set to infinity;
Randomly selecting an unprocessed target point p from the ball tree index space, marking the processed target point p, and adding the processed target point p into the output list;
The entire ball tree index space is traversed by the following steps until all data points are marked as processed:
Step a: if the target point p is a core object, determining the distance from the target point p to each unprocessed data point of the target point p in the EPS field by utilizing the ball tree index space, adding the distance to the priority queue, and updating the distance from the target point p to the nearest core object;
If the data point is already in the priority queue and the distance from the data point to the target point p is smaller than the current reachable distance, updating the distance of the data point in the priority queue;
step b: taking out the data point with the smallest reachable distance from the priority queue, marking the data point as processed, and adding the data point into the output list;
step c: if the extracted data point is a core object, repeating the step a;
Step d: if the priority queue is not empty, repeating the steps b and c;
after the ordered output list is output, cluster extraction is carried out on the ordered list, and a cluster set formed by a plurality of clusters is obtained.
7. The multi-dimensional data processing analysis method of claim 1, wherein the steps of
Traversing the text data set to construct TF-IDF vectors of all documents in the text data set, and forming a TF-IDF vector matrix includes:
The TF-IDF values of all words in the document are formed into TF-IDF vectors;
The TF-IDF vectors of all documents are arranged according to rows to form a TF-IDF vector matrix;
The number of rows of the matrix is the number of documents, and the number of columns is the number of words.
8. The method of multidimensional data processing analysis of claim 1, wherein recursively grouping individual vectors in a TF-IDF vector matrix into nested ball trees using a ball tree construction algorithm to obtain a ball tree index space in which the TF-IDF vectors are nested comprises:
selecting a vector from the TF-IDF vector matrix as a root node of a ball tree, and taking the vector as an original ball tree, wherein the original ball tree comprises all vectors in the TF-IDF vector matrix;
Dividing the vectors in the original ball tree into a first group and a second group according to a first predetermined dividing value, wherein the distance from the vectors in the first group to the root node is smaller than or equal to the first dividing value, and the distance from the vectors in the second group to the root node is larger than the first dividing value;
recursively constructing leaf nodes for each set of vectors;
Repeating the steps for each leaf node until a termination condition is met, wherein the termination condition is that the number of vectors contained in each leaf node is smaller than a preset termination threshold value; when the termination condition is satisfied, a ball tree index space is formed.
9. The multi-dimensional data processing analysis method of claim 8, wherein the first segmentation value is determined by:
determining projection characteristics, and calculating projection values of all TF-IDF vectors in the directions corresponding to the projection characteristics based on the projection characteristics to form a projection value sequence;
a median of the sequence of projection values is determined and the median is determined as the first segmentation value.
10. The multi-dimensional data processing analysis method of claim 9, wherein the determining projection characteristics comprises:
carrying out standardization processing on all TF-IDF vectors;
Calculating a covariance matrix of the normalized TF-IDF vector matrix;
performing eigenvalue decomposition on the covariance matrix to obtain eigenvalues and corresponding eigenvectors, wherein the eigenvalues represent variances of the data in the direction of the eigenvectors, and the eigenvectors represent importance of the data in the direction of the eigenvectors;
and determining the number of projection features to be reserved according to a preset dimension reduction target, and determining corresponding feature vectors as projection features.
11. The multi-dimensional data processing analysis method of claim 10, wherein normalizing all TF-IDF vectors comprises:
Calculating the mean value and standard deviation of each TF-IDF vector;
For each TF-IDF vector in the TF-IDF vector matrix, performing a normalization process by:
y=(a-m)/s;
where a is the original TF-IDF value, m is the mean value of the corresponding TF-IDF features, and s is the standard deviation of the corresponding TF-IDF features.
12. The multi-dimensional data processing analysis method according to claim 10, wherein calculating the covariance matrix of the normalized TF-IDF vector matrix comprises:
performing transposition operation on the standardized TF-IDF matrix so that each row represents a TF-IDF characteristic and each column represents a document;
Performing product operation on the transposed TF-IDF matrix and the original TF-IDF matrix to obtain an intermediate matrix;
Dividing each element in the intermediate matrix by the total document number contained in the text data set to obtain a covariance matrix.
13. The method of multidimensional data processing analysis according to claim 6, wherein the outputting the ordered output list, and then performing cluster extraction on the ordered list, to obtain a cluster set comprising a plurality of clusters, comprises:
Drawing a scatter diagram based on the ordered output list, wherein the scatter diagram comprises an x-axis and a y-axis, the x-axis represents the index of a data point in the ball tree index space, and the y-axis represents the reachable distance of the data point;
Identifying and marking valleys on the scatter plot;
determining a cluster segmentation threshold according to the identified valley, wherein the cluster segmentation threshold is larger than a y value corresponding to the valley;
Traversing the output list graph according to the cluster segmentation threshold value, and distributing continuous data points to a cluster, wherein the reachable distance of the data points in the cluster is smaller than or equal to the corresponding cluster segmentation threshold value;
and when the reachable distance of the data point is higher than the current cluster segmentation threshold value, starting a new cluster until the output list is traversed, and obtaining a cluster set.
14. The method of multidimensional data processing analysis of claim 13, wherein the identifying and marking valleys on the scatter plot comprises:
first or second derivatives of the reachable distances on the scatter plot are calculated to identify valleys.
15. The method of multidimensional data processing analysis of claim 13, wherein after traversing the output list and obtaining a plurality of clusters, the method further comprises:
and calculating the number of data points contained in each cluster, and eliminating the clusters smaller than a preset number threshold.
16. The multi-dimensional data processing analysis method of claim 1, wherein the calculating the support of each item in the candidate 1-item set in a pre-constructed transaction comprises:
the support of each item in the candidate 1-item set in the transaction is calculated as follows:
support(item_i)=count(item_i)/N;
Wherein the item contained in the candidate 1-item set is item_i, count (item_i) represents the occurrence times of the item in all transactions, and N is the total transaction number; the support (item_i) represents the support of item item_i.
17. The multi-dimensional data processing analysis method of claim 1, further comprising, after constructing the frequent n-term set:
Based on the support degree of each item in the frequent 1-item set, ordering the support degree in a descending order, and generating an item header table, wherein the item header table stores names and support degrees of the items;
Constructing a tree data structure of each item based on the item header table;
The new frequent item set is mined through the tree data structure;
And merging the new frequent item set obtained by mining with the original frequent item set to obtain a final frequent item set.
18. The method of multidimensional data processing analysis of claim 17, wherein constructing a tree data structure of each item based on the item header table comprises:
Creating a root node of the attribute data structure;
Starting from the root node, according to the ordering of each item in the item header table, sequentially growing the items in each transaction in the transaction library on the root node to obtain a tree data structure.
19. The multi-dimensional data processing analysis method of claim 18, wherein mining out new frequent item sets from the tree data structure comprises:
the item header table also comprises a node position pointer of the item in the tree data structure;
traversing the tree data structure for a target item in the item header table based on a node position pointer corresponding to the target item to obtain an item path containing the target item;
Traversing the whole item head table, and storing all obtained item paths as a new frequent item set.
20. The method according to claim 1, wherein determining items meeting a preset confidence level based on all the generated frequent item sets, and outputting the association rule comprises:
Generating an association rule set based on the frequent item set, wherein the association rule set comprises a plurality of association rules;
The rule confidence of each association rule is calculated as follows:
For association rule X-Y, then there are:
Confidence (X-Y) =support (X, Y)/support (X);
In the association rule of X-Y, X is a condition, Y is a result, and the association rule X-Y indicates that under the condition that the condition X appears, the result Y also appears at the same time; the support degree (X, Y) represents the support degree of the frequent item (X, Y), and the support degree (X) represents the support degree of the frequent item (X);
Based on the confidence coefficient of each association rule, comparing the confidence coefficient with a preset confidence coefficient threshold value, screening out the association rule meeting the confidence coefficient requirement, and outputting, wherein the confidence coefficient requirement comprises that the confidence coefficient is not smaller than the confidence coefficient threshold value.
21. The method of multi-dimensional data processing analysis according to claim 6, wherein after obtaining the cluster set of the plurality of clusters, the method further comprises:
evaluating the clustering effect of each cluster in the cluster set to obtain an evaluation result;
and adjusting the value of the field radius Eps or the minimum point MinPts according to the evaluation result.
22. The method of multidimensional data processing and analysis according to claim 21, wherein evaluating the clustering effect of each cluster in the cluster set includes:
for each cluster, the intra-class sum of squares ss_w is calculated by:
SS_W=Σ_{i=1}{k}Σ_{x∈C_i}||x-μ_i||2;
Wherein ss_w represents the sum of squares within the class, i represents the i-th cluster, i=1, 2,..k, x e Ci represents each element in cluster Ci, |x- μ_i|| 2 represents the square of the distance of element x to cluster centroid μ_i;
for each cluster, the sum of squares between classes is calculated by the following equation:
SS_B=Σ_{i=1}{k}n_i||\bar{x}_i-\bar{x}||2;
Wherein ss_b represents the sum of squares between classes, i represents the i-th cluster, i=1, 2,..k, n_i represents the number of elements in cluster c_i, ||\bar { x } -i\bar { x } | 2 represents the square of the distance from cluster centroid\bar { x } -i to centroid\bar { x } of the cluster set;
The evaluation index was calculated by the following formula:
CH-Index={SS_B/(k-1)}/{SS_W/(n-k)};
Where k represents the number of clusters and n represents the total number of TF-IDF vectors contained in the spherical tree index space.
23. The multi-dimensional data processing analysis method of claim 22, wherein adjusting the value of the domain radius Eps or the minimum point number MinPts according to the evaluation result comprises:
If the CH-Index is larger than a preset expected value, decreasing the field radius Eps or increasing the minimum point MinPts;
and if the CH-Index is smaller than the preset expected value, increasing the radius Eps of the field or reducing the minimum point MinPts.
24. The method of claim 1, wherein assigning a unique identification code to each cluster and replacing all data points contained in the cluster with the identification code comprises:
Creating an integer variable for generating a unique cluster identification code;
For each cluster, assigning a new unique identification code using the integer variable;
traversing each data point in all clusters;
for each data point, determining the cluster to which the data point belongs, and replacing the value of the data point with the unique identification code of the cluster.
25. A multi-dimensional data processing analysis device, the device comprising:
the system comprises a data acquisition unit, a data processing unit and a data processing unit, wherein the data acquisition unit is used for acquiring multi-source heterogeneous service data through a pre-configured data interface, and the multi-source heterogeneous service data comprises structured service data and unstructured service data;
The feature extraction unit is used for extracting features from the unstructured service data, storing the extracted features, and storing the extracted features and the structured service data into a preset data structure together to form a text data set;
The vector construction unit is used for traversing the text data set, constructing TF-IDF vectors of all documents in the text data set and forming a TF-IDF vector matrix;
A ball tree construction unit, configured to recursively group each vector in a TF-IDF vector matrix into nested ball trees using a ball tree construction algorithm, to obtain a ball tree index space in which the TF-IDF vector is nested;
The clustering unit is used for performing density-based clustering on all data points based on the ball tree index space to obtain a plurality of clustering clusters;
the identification code distribution unit is used for distributing a unique identification code for each cluster and replacing all data points contained in the cluster with the identification codes;
a candidate item set construction unit, configured to construct a candidate 1-item set by using a cluster as an item according to the unique identification code;
The support degree calculation unit is used for calculating the support degree of each item in the candidate 1-item set in a pre-constructed transaction library, and comparing the support degree with a preset first support degree threshold value, wherein the transaction library comprises a plurality of transactions, and the transactions consist of one or more items;
The item set screening unit is used for screening out items which are not smaller than the first support threshold value to obtain frequent 1-item sets, wherein each item in the frequent 1-item sets consists of a cluster;
the frequent item set construction unit is used for constructing a frequent n-item set based on the mode until the frequent item set of n+1 cannot be constructed;
and the association rule output unit is used for determining the items conforming to the preset confidence coefficient based on all the generated frequent item sets and outputting association rules.
26. A multi-dimensional data processing analysis system, comprising:
a processor, a memory, an input-output unit, and a bus;
The processor is connected with the memory, the input/output unit and the bus;
the memory holds a program which the processor invokes to perform the method of any one of claims 1 to 24.
27. A computer readable storage medium having a program stored thereon, which when executed on a computer performs the method of any of claims 1 to 24.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410632459.7A CN118211038B (en) | 2024-05-21 | 2024-05-21 | Multidimensional data processing analysis method, device, system and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410632459.7A CN118211038B (en) | 2024-05-21 | 2024-05-21 | Multidimensional data processing analysis method, device, system and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN118211038A true CN118211038A (en) | 2024-06-18 |
CN118211038B CN118211038B (en) | 2024-08-23 |
Family
ID=91450828
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410632459.7A Active CN118211038B (en) | 2024-05-21 | 2024-05-21 | Multidimensional data processing analysis method, device, system and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118211038B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105787097A (en) * | 2016-03-16 | 2016-07-20 | 中山大学 | Distributed index establishment method and system based on text clustering |
CN108491377A (en) * | 2018-03-06 | 2018-09-04 | 中国计量大学 | A kind of electric business product comprehensive score method based on multi-dimension information fusion |
CN113345557A (en) * | 2020-03-03 | 2021-09-03 | 北京悦熙兴中科技有限公司 | Data processing method and system |
WO2021178731A1 (en) * | 2020-03-04 | 2021-09-10 | Karl Denninghoff | Neurological movement detection to rapidly draw user attention to search results |
-
2024
- 2024-05-21 CN CN202410632459.7A patent/CN118211038B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105787097A (en) * | 2016-03-16 | 2016-07-20 | 中山大学 | Distributed index establishment method and system based on text clustering |
CN108491377A (en) * | 2018-03-06 | 2018-09-04 | 中国计量大学 | A kind of electric business product comprehensive score method based on multi-dimension information fusion |
CN113345557A (en) * | 2020-03-03 | 2021-09-03 | 北京悦熙兴中科技有限公司 | Data processing method and system |
WO2021178731A1 (en) * | 2020-03-04 | 2021-09-10 | Karl Denninghoff | Neurological movement detection to rapidly draw user attention to search results |
Non-Patent Citations (2)
Title |
---|
ANUSHKA LAD 等: "Machine Learning Based Resume Recommendation System", 《INTERNATIONAL JOURNAL OF MODERN DEVELOPMENTS IN ENGINEERING AND SCIENCE》, vol. 1, no. 3, 5 March 2022 (2022-03-05), pages 17 - 20 * |
李心舒: "基于Spark的多领域网络新闻热点挖掘技术研究与实现", 《中国优秀硕士学位论文全文数据库信息科技辑》, no. 08, 15 August 2019 (2019-08-15), pages 138 - 1338 * |
Also Published As
Publication number | Publication date |
---|---|
CN118211038B (en) | 2024-08-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Shah et al. | Document clustering: a detailed review | |
US10019442B2 (en) | Method and system for peer detection | |
Jiang et al. | Clustering uncertain data based on probability distribution similarity | |
US20060242190A1 (en) | Latent semantic taxonomy generation | |
CN109886334B (en) | Shared neighbor density peak clustering method for privacy protection | |
US20050114331A1 (en) | Near-neighbor search in pattern distance spaces | |
Nayyar et al. | Comprehensive analysis & performance comparison of clustering algorithms for big data | |
Hesabi et al. | Data summarization techniques for big data—a survey | |
Huang et al. | QCC: A novel clustering algorithm based on quasi-cluster centers | |
Mic et al. | Binary sketches for secondary filtering | |
Zhang et al. | An affinity propagation clustering algorithm for mixed numeric and categorical datasets | |
Le et al. | Indexable Bayesian personalized ranking for efficient top-k recommendation | |
Drakopoulos et al. | Higher order graph centrality measures for Neo4j | |
Xiao et al. | A survey of parallel clustering algorithms based on spark | |
US20030033138A1 (en) | Method for partitioning a data set into frequency vectors for clustering | |
Zhao et al. | Monochromatic and bichromatic ranked reverse boolean spatial keyword nearest neighbors search | |
Mehta et al. | A review of clustering techiques in various applications for effective data mining | |
CN118211038B (en) | Multidimensional data processing analysis method, device, system and storage medium | |
Chan et al. | Real-time clustering for large sparse online visitor data | |
Dhoot et al. | Efficient Dimensionality Reduction for Big Data Using Clustering Technique | |
Wang et al. | δ‐Open set clustering—A new topological clustering method | |
Wu et al. | A data-driven approach for extracting representative information from large datasets with mixed attributes | |
Veparala et al. | Big Data and Different Subspace Clustering Approaches: From social media promotion to genome mapping | |
Kumar et al. | Partition Algorithms–A Study and Emergence of Mining Projected Clusters in High-Dimensional Dataset | |
Anitha et al. | Bloom filter-based framework for cache management in large cloud metadata databases |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |