CN107609105B

CN107609105B - Construction method of big data acceleration structure

Info

Publication number: CN107609105B
Application number: CN201710817537.0A
Authority: CN
Inventors: 段贵多; 罗光春; 田玲; 秦科
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2017-09-12
Filing date: 2017-09-12
Publication date: 2020-07-28
Anticipated expiration: 2037-09-12
Also published as: CN107609105A

Abstract

The invention relates to a construction method of a big data acceleration structure, which comprises the following steps: A. preprocessing data to form a data set conforming to an operation process; B. clustering, calculating the similarity between records in the category, and enabling the most similar records in the group to have the minimum spatial distance according to the grouping result of a clustering algorithm; C. establishing a mapping relation among the transaction attributes, the transaction attribute weights and the transaction records according to the three-level index, and circularly performing the process until all data are mapped; D. initializing a compression index structure, a transaction attribute weight index and a transaction attribute, determining the range of the shared attribute weight of the continuous records, traversing the inverted index mapping structure, and compressing the continuous records under the shared transaction attribute weight through a stroke compression algorithm. The method can quickly establish an acceleration structure of big data association analysis, and remarkably accelerates the processing speed and the data loading speed of the model.

Description

Construction method of big data acceleration structure

Technical Field

The invention relates to a method for accelerating data processing, in particular to a method for constructing a big data acceleration structure.

Background

Big data technology has become the most efficient and common technology to process massive amounts of data. Police affair big data is regarded as one of the most representative scenes in big data processing scenes, and is paid more and more attention. In the process of analyzing mass data, the processing speed and the processing performance of a general big data platform are one of the problems which need to be solved urgently. In a big data analysis scene, particularly in police service big data, the most common analysis mode is correlation analysis, and correlation analysis is performed on related factors related to an analysis object, so that the accuracy of the police service analysis can be really improved. In the current-stage big data processing system, different algorithm models are adopted for different associated analysis services, and the main mode for solving the processing speed of the models at present is an algorithm parallelization method for a service platform, but the method only can be used for a single algorithm and a designated platform, so that the algorithm is large in limitation and the effect is not ideal. The method mainly comprises the steps of dividing an integral data set aiming at a specific parallelization framework of the platform, enabling the divided data set to be uniformly distributed in a cluster, processing a local data set by utilizing an association rule algorithm model, and collecting results of all nodes through a collection function so as to finish parallelization operation of the algorithm under the framework. The method has obvious defects that improvement aiming at a specific platform cannot achieve better performance on other platforms, and particularly under a police service complex platform, correlation analysis algorithm models are various, and some algorithms are not suitable for parallelization operation, so that the limitation of the method is increased.

Disclosure of Invention

The invention provides a method for constructing a big data acceleration structure, which aims to improve the speed and the accuracy of data processing during the analysis of associated big data.

The invention discloses a method for constructing a big data acceleration structure, which comprises the following steps:

A. Data preprocessing: performing data cleaning, data integration and data conversion on the original data to form a data set conforming to the operation process;

B. Clustering treatment: clustering the preprocessed data, calculating the similarity among the records in the category through the Hamming distance after the clustering is finished, reordering the records, and calculating the record similarity in each group according to the grouping result of the clustering algorithm so as to minimize the spatial distance of the most similar record in the group;

C. The mapping structure of the inverted index: initializing an index file, an attribute weight index item and an attribute index item, extracting attributes of the sorted data and an attribute weight list, then constructing an inverted index structure, establishing a mapping relation among the transaction attributes, the transaction attribute weights and the transaction records according to a three-level index, and circularly performing the process until all data are mapped; the "transaction" refers to an atomic operation in data processing, and the property of the atomic operation is similar to that of a transaction in a database;

The normal process of establishing the index is to establish the index according to the number or line number of the data record, and the index access sequence is index- > record number- > attribute- > weight, which is called as forward index. The reverse index is the index established not according to the line number of the data record, but according to the attribute and the weight of the record, the record number is the weight, the access sequence is the index- > attribute- > weight- > record number, and the index structure opposite to the access sequence of the forward index is called the reverse index.

D. the method comprises the steps of initializing a compression index structure, a transaction attribute weight index and a transaction attribute, determining the range of a shared attribute weight of continuous records, traversing an inverted index mapping structure, and compressing the continuous records under the shared transaction attribute weight through a stroke compression algorithm, wherein the stroke compression algorithm, namely an R L E algorithm, is characterized in that adjacent elements with the same value in a data record line are replaced by two values, namely an initial position and an adjacent weight, for example, aaabcccccddeee can be replaced by 3a1b6c2d3E, an R L E compression method can be used for effectively compressing data, and a plurality of specific stroke compression methods are derived through the R L E principle.

Further, the data cleaning in step a includes deleting the original data, filling out missing values, and smoothing/filtering the noise data.

Specifically, deletion of missing data or filling of a missing value is judged by an attribute weight coverage rate p of original data, where the attribute weight coverage rate p is:

Wherein A is data attribute, is attribute weight, represents importance degree of attribute, and ₁+₂+...+_kif p is larger than or equal to omega, filling missing values in the missing data, otherwise, deleting the original data;

The smoothing/filtering of noise data employs a mean filtering method, which selects a template for the current data to be processed, the template being a number of records adjacent to the template, and a method of replacing the original record values with the template mean, expressed as:

Where k represents the number of attributes, g () represents the mean, M is the number of data neighbors to the current data, s represents the neighbor set of the current data, and f represents one of the data in s.

Specifically, the method for filling in the missing value includes: and measuring the data in the known complete original data through Euclidean distance, inquiring the complete data most similar to the missing value, and supplementing the missing value according to the weight of the relevant field of the complete data.

Further, in the data integration in step a, data in a plurality of data sets are combined into a unified database, and redundant attributes are filtered through data cross matching and multi-data set fusion of each data set.

Further, the data conversion in step a includes performing attribute construction on the data set after data integration, reducing attribute dimensions of the data set, and performing integrity check and supplementation on the data set after dimension reduction. Reducing the attribute dimension of the data set can reduce the occupation of a memory or a hard disk, and if the memory is insufficient or the memory overflows in the calculation process, the PCA (Principal Component Analysis) is needed to obtain the data with the low-dimension attribute; secondly, the machine learning speed can be accelerated by data dimension reduction, in many cases, data attributes may need to be checked, but high-dimensional (multiple) attributes cannot be observed, and at the moment, the data attributes can be reduced to 2D or 3D, namely, the data attribute dimensions are reduced to 2 or 3, so that the data can be visually observed.

Specifically, the reducing the attribute dimension of the data set includes:

Transforming the data set into a matrix form and normalizing the attributes, the normalizing operation comprising: enabling all attributes to have similar scales, avoiding the influence of overlarge size difference of the attribute scales on the dimension reduction effect, and then performing zero-mean normalization operation;

Calculating a covariance matrix of the data set attribute, and calculating an eigenvalue and an eigenvector of the covariance matrix through a singular value decomposition algorithm;

And obtaining a dimension reduction matrix through dimension reduction calculation, and mapping the data set to a low-dimensional space through the dimension reduction matrix.

Specifically, the clustering process in step B includes:

B1. Setting the value range [ K1, K2], K1 and K2 of the cluster number K as the classification number, and assigning the cluster number K as the lower limit K1 of the value range;

B2. Randomly selecting K clustering centers;

B3. For a class, assigning it to the nearest cluster center by distance calculation;

B4. Recalculating a new cluster center;

B5. Dividing each obtained class into two subclasses by a dichotomy method, and comparing the BIC scores of each pair of the father class and the subclasses; the BIC is Bayesian Information Criterion (BIC), the number of the clustering categories is automatically judged through the BIC, and a data set D and some selectable models M are assumed to exist _jDifferent model M _jCorresponding to different values of K. The BIC calculation formula is:

Where R is the total number of data points of the data set D, l _j(D) Representing the log-likelihood function of the data for the jth model, and taking the value of the maximum likelihood function, p _jIs a model M _jThe number of parameters in (1). The principle is to judge the effect of the clustering result by using the posterior decision probability.

B6. Calculating to obtain a pair of parent class and subclass with the largest difference of BIC scores;

B7. Saving the parent subclass in the step B6, and keeping other remaining parent subclasses unchanged, so that the clustering number K is K + 1;

B8. Turning to step B2, calculating the category condition in the new case;

B9. If the K value reaches the upper limit value K2 or the results of two continuous circulation are the same, indicating that no clustering center needs to be split (namely step B5), and ending the circulation;

B10. And simultaneously calculating Hamming distances among the records for a plurality of cluster categories, and reordering the data in the group.

The method for constructing the big data acceleration structure can quickly establish the acceleration structure of big data association analysis, overcomes the limitation and platform limitation of the existing method, can obviously accelerate the processing speed and data loading speed of the model no matter a traditional association analysis algorithm or an association analysis algorithm based on an evolutionary theory, has a simple construction process, and is suitable for various large-scale data scene applications including police data.

The present invention will be described in further detail with reference to the following examples. This should not be understood as limiting the scope of the above-described subject matter of the present invention to the following examples. Various substitutions and alterations according to the general knowledge and conventional practice in the art are intended to be included within the scope of the present invention without departing from the technical spirit of the present invention as described above.

Drawings

FIG. 1 is a flow chart of a method for constructing a big data acceleration structure according to the present invention.

Detailed Description

The method for constructing the big data acceleration structure of the invention as shown in FIG. 1 comprises the following steps:

A. Data preprocessing: and carrying out data cleaning, data integration and data conversion on the original data to form a data set conforming to the operation process.

Wherein the data cleaning includes deleting raw data, filling in missing values, and smoothing/filtering noisy data.

In deleting and filling in missing values of original data, the missing data is judged according to the attribute weight coverage rate p of the original data, wherein the attribute weight coverage rate p is as follows:

The method for filling in the missing value comprises the following steps: and measuring the data in the known complete original data through Euclidean distance, inquiring the complete data most similar to the missing value, and supplementing the missing value according to the weight of the relevant field of the complete data.

In the data integration, data in a plurality of data sets are combined into a unified database, and redundant attributes are filtered through data cross matching and multi-data set fusion of each data set.

The data conversion comprises the steps of carrying out attribute construction on the data set after data integration, reducing attribute dimensionality of the data set and carrying out integrity check and supplement on the data set after dimensionality reduction. The reducing the dataset attribute dimension comprises:

B. Clustering treatment: clustering the preprocessed data, calculating the similarity among the records in the category through the Hamming distance after the clustering is finished, reordering the records, and calculating the record similarity in each group according to the grouping result of the clustering algorithm, so that the most similar record in the group has the minimum spatial distance. The method specifically comprises the following steps:

B1. And setting the value range [ K1, K2], K1 and K2 of the cluster number K as the classification number, and assigning the cluster number K as the lower limit K1 of the value range. The values of K1 and K2 are determined by the attributes of the particular service data.

B2. And randomly selecting K clustering centers.

B3. For a class, it is assigned to the nearest cluster center by distance calculation.

B4. The new cluster center is recalculated.

B5. Dividing each obtained class into two subclasses by a dichotomy method, and comparing each pair BIC scores for parent and child classes. The BIC is Bayesian Information Criterion (BIC), the number of the clustering categories is automatically judged through the BIC, and a data set D and some selectable models M are assumed to exist _jDifferent model M _jCorresponding to different values of K. The BIC calculation formula is:

B6. And calculating to obtain a pair of parent class and subclass class with the largest difference of BIC scores.

B7. The parent subclass of step B6 is saved, and the remaining parents remain unchanged, so that the number of clusters K equals K + 1.

B8. Moving to step B2, the category case for the new case is calculated.

B9. If the value of K reaches the upper limit value K2 or the results of two consecutive cycles are the same, it is indicated that no cluster center needs to be split (i.e. step B5), and the cycle is ended.

C. The mapping structure of the inverted index: initializing an index file, an attribute weight index item and an attribute index item, and extracting the attribute and the attribute weight list of the sorted data. And constructing an inverted index structure, establishing a mapping relation among the transaction attributes, the transaction attribute weights and the transaction records according to the three-level index, and circularly performing the process until all data are mapped, thereby constructing the three-level inverted index structure.

D. the method comprises the steps of initializing a compression index structure, a transaction attribute weight index and a transaction attribute, determining a range of shared attribute weights of continuous records, traversing an inverted index mapping structure, and compressing the continuous records under the shared transaction attribute weights through a stroke compression algorithm, wherein the principle of the stroke compression algorithm, namely an R L E algorithm, is that adjacent elements with the same value in a data record line are replaced by two values, namely a starting position and an adjacent weight, for example, aaabcccccddeee, 3a1b6c2d3E can be used for replacing the adjacent elements, and the R L E compression method can be used for effectively compressing data.

Claims

1. The method for constructing the big data acceleration structure is characterized by comprising the following steps:

C. The mapping structure of the inverted index: initializing an index file, an attribute weight index item and an attribute index item, extracting attributes of the sorted data and an attribute weight list, then constructing an inverted index structure, establishing a mapping relation among the transaction attributes, the transaction attribute weights and the transaction records according to a three-level index, and circularly performing the process until all data are mapped;

D. Stroke compression: initializing a compression index structure, a transaction attribute weight index and a transaction attribute, determining the range of the shared attribute weight of the continuous records, traversing the inverted index mapping structure, and compressing the continuous records under the shared transaction attribute weight through a stroke compression algorithm.

2. The big data acceleration structure construction method according to claim 1, characterized by: the data cleaning described in step a includes deleting the original data, filling in missing values, and smoothing/filtering the noise data.

3. The big data acceleration structure construction method according to claim 2, characterized in that: deleting missing data or filling missing values in the missing data is judged according to the attribute weight coverage rate p of the original data, wherein the attribute weight coverage rate p is as follows:

Smoothing/filtering the noise data employs mean filtering, represented as:

4. The big data acceleration structure construction method according to claim 2, characterized in that: the method for filling in the missing value comprises the following steps: and measuring the data in the known complete original data through Euclidean distance, inquiring the complete data most similar to the missing value, and supplementing the missing value according to the weight of the relevant field of the complete data.

5. The big data acceleration structure construction method according to claim 1, characterized by: and B, combining the data in the data sets into a unified database in the data integration in the step A, and filtering redundant attributes through data cross matching and multi-data-set fusion of each data set.

6. The big data acceleration structure construction method according to claim 1, characterized by: the data conversion in the step A comprises the steps of carrying out attribute construction on the data set after data integration, reducing attribute dimensionality of the data set and carrying out integrity check and supplement on the data set after dimensionality reduction.

7. The big data acceleration structure construction method according to claim 6, characterized in that: the reducing the dataset attribute dimension comprises:

Transforming the data set into a matrix form and normalizing the attributes;

8. The big data acceleration structure construction method according to claim 1, characterized by: the clustering process in the step B comprises the following steps:

B2. Randomly selecting K clustering centers;

B4. Recalculating a new cluster center;

B5. Dividing each obtained class into two subclasses by a dichotomy method, and comparing the BIC scores of each pair of the father class and the subclasses;

B8. Turning to step B2, calculating the category condition in the new case;

B9. If the K value reaches the upper limit value K2 or the results of two continuous circulation are the same, indicating that no clustering center needs to be split, and ending the circulation;