CN112131303A

CN112131303A - Large-scale data lineage method based on neural network model

Info

Publication number: CN112131303A
Application number: CN202010988710.5A
Authority: CN
Inventors: 李�杰; 叶一舟
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2020-09-18
Filing date: 2020-09-18
Publication date: 2020-12-25

Abstract

The invention discloses a large-scale data lineage method based on a neural network model, which comprises the following steps: (1) generating a network training set; the method comprises the steps of array sequencing, dimension standard division and training subset division; sorting the data in the data set according to values in different dimensions in the data set; determining a division standard for each dimension to solve the sample exhaustion problem; dividing the training set into a number of smaller training subsets; (2) training a neural network model; a layered network structure is used for replacing the traditional neural network structure so as to solve the error problem caused by large sample data difference; the hierarchical structure specifically comprises a network selector and a subnet; (3) visual interaction and lineage; the method specifically comprises a space scatter diagram, a space-time projection view and a mode contrast view; the data set visualization interactive exploration is performed on the data set, and a visualization mode is used for facilitating exploration of data results by a user; and allows users to explore data sources in an intuitive manner.

Description

Large-scale data lineage method based on neural network model

Technical Field

The patent mainly relates to the field of machine learning and data visualization, in particular to a method for real-time interaction and neural network model optimization of a large-scale data set.

Background

In recent years researchers have been confronted with data sets that contain exponentially increasing amounts of dataLong and long^[4]This undoubtedly brings trouble to interactive visualization exploration and lineage. Recently proposed techniques allow analysts to interactively explore large-scale datasets in real-time^[5]However, these techniques ignore real data that one might care about hiding behind the statistical data distribution^[10]. We have realized the reverse generation of data from visualization, so the visual view will no longer be limited to displaying statistics of the data, it can also be used as data to generate more complex visual views, or to explore the detailed distribution of data in a subset of views.

Research on data lineage has been conducted in the field of databases for some time^[7]. Traditional approaches capture source information by extending the underlying data model^[9]The disadvantages resulting therefrom are evident: the access source must be stored using a different model than the actual data. Miles et al^[8]Products produced from the data and details describing how the source of the results may be hidden and how the results may be produced are presented, which study and discuss how the source of the data may help scientists perform experiments. Boris Glavic et al^[6]A method of tagging a result tuple for a source tuple with query rewrite is proposed and its feasibility is demonstrated in a database. Dursun et al^[1]A new intermediary reuse model is proposed that can cache internal physical data structures implemented during query processing. This work speeds up the process of analyzing queries by studying the reuse of intermediaries in databases. Ikeda et al^[2]The panda of (a) implements source capture, storage, operators and queries. They apply data lineage to tasks such as debugging, auditing, data integration, security, iterative analysis and cleanup. On their basis, Fotos Psallidas et al^[3]Smoke is proposed, which is an in-memory database engine, without sacrificing lineage capture overhead. The Smoke stores the pedigree condition in the form of the hash table in advance in the form of the hash table, so that the time overhead brought by pedigree query is saved, and the real-time visual interaction requirement can be met.

The above-mentioned work mainly uses larger-scale data sets, however, these work have some drawbacks and disadvantages: first, some workers create hash indexes as each input to speed up lineage queries, but at the same time, as the data size increases, the size of the hash table also increases, which may cause problems such as memory exhaustion. Secondly, recent work uses a method to implement hash tables in memory in real time to speed up the query, but even if this method optimizes the time to generate hash tables in real time, it still incurs storage overhead and additional query time. At the same time, the above work cannot regenerate visualization using query data, it can only establish a connection between multiple visualization views.

Disclosure of Invention

An object of the present invention is to solve the following problems in the prior art. 1. And the neural network model is used for replacing the traditional index structure, so that the time overhead and the storage overhead brought by query are reduced. 2. For large amounts of data, neural networks do not satisfy the relationship between queries and indexes well, and therefore a hierarchical structure is needed to solve this problem. The hierarchical structure comprises a first layer network selector used for searching and inquiring the corresponding subnet; and the second layer of sub-network is used for calculating and outputting the query result. 3. A large-scale data set often comprises a plurality of dimensions, and a user may need to constrain one dimension, so that different division standards need to be formulated for different dimensions to solve the problem of simultaneous multi-dimensional constrained lineage query, and a neural network model is trained for each dimension. Therefore, the invention provides a neural network model-based framework to follow the exploration of large-scale data sets. Firstly, the framework adopts an index structure based on a neural network model, and meets the requirement of real-time interactive lineage query. And secondly, the framework integrates a hierarchical structure network model and a hash table to realize the processing of error data. And finally, designing a visual interface supporting quick query and interaction on the data result.

The purpose of the invention is realized by the following technical scheme:

the large-scale data lineage method based on the neural network model comprises the following steps:

(1) generating a network training set; the method comprises the steps of array sequencing, dimension standard division and training subset division; sorting and storing the data in the data set according to values in different dimensions in the data set; determining a division standard for each dimension to solve the sample exhaustion problem; dividing the training set into a plurality of subsets as training subsets;

(2) training a neural network model; a layered network structure is used for replacing the traditional neural network structure so as to solve the error problem caused by large sample data difference; the hierarchical network structure specifically comprises a network selector and a subnet, wherein the network selector is used as a first layer and is used for finding a correct corresponding subnet for inquiry; for each subnet, respectively training by using the training subsets;

(3) visual interaction and lineage; mapping the output result of the neural network into a plurality of views, specifically comprising a space scatter diagram, a space-time projection view and a mode contrast view; the data set visualization interactive exploration is performed on the data set, and a visualization mode is used for facilitating exploration of data results by a user; and allows users to explore data sources in an intuitive manner.

Compared with the prior art, the technical scheme of the invention has the following beneficial effects:

1. visualization is a new query method based on visualization to find data, and focuses on how to find real data behind the visualization view area of interest. It is understood that the current methods cannot realize real-time interactive search of real data, especially details in the data. These data can then be used to generate new visualization views to assist the user in further analysis and research.

2. A hierarchical neural network structure. The method can effectively reduce the range of values to be predicted of each neural network, help to control the number of neurons, enable the network to be easy to observe and adjust due to the small number of neurons, and solve the problem of updating of the network. The structure can effectively control errors, one neural network only needs to adapt to the relationship between thousands of data samples and labels, and the maximum allowable error can be simply controlled to be a reasonable value.

3. A neural network model-based approach uses a composite structure of hash tables and hierarchical neural networks to support lineage queries over large data at interactive rates. It costs little additional storage overhead and enables fine-grained lineage queries. The structure realizes real-time interactive exploration, and simultaneously, the cost of storage is lower than that of all the existing technologies. The architecture supports updates, solving the common problems of neural network architectures.

Drawings

Fig. 1 is a general flow chart of the proposed method.

FIG. 2 is a diagram of a network training set generation.

Fig. 3 is a diagram of a neural network architecture. In the figure: a represents the query input by the user, b represents the output of the corresponding sub-network of the query through the network selector, and c represents the position of the data corresponding to the output of the sub-network corresponding to the query.

Detailed Description

The invention is described in further detail below with reference to the figures and specific examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention provides a neural network model-based framework to follow exploration of large-scale data sets. Firstly, the framework adopts an index structure based on a neural network model, all query conditions are used as samples and query results are used as labels for each dimension in a data set, and the neural network model is trained to meet real-time interactive lineage query. And secondly, the hierarchical network model and the hash table effectively reduce the range of values to be predicted of each neural network by dividing the network into two layers of a network selector and a subnetwork, and record data with larger errors by using the hash table, so that the processing of error data is realized. Specifically, as shown in fig. 1, the method mainly includes the following steps:

the method comprises the following steps: network training set generation (fig. 2). The specific operations comprise array sequencing, dimension standard division and training subset division. Sorting the data in the data set according to values in different dimensions in the data set, and then storing all the sequences; determining a division standard for each dimension to solve the sample exhaustion problem; the training samples are divided into many smaller subsets to overcome the small errors that neural networks do not adapt well.

Sorting the array is a very important component of the data structure. In short, sorting groups means sorting the data in a dataset according to values in different dimensions and then storing all these orders. Training a neural network requires some functional relationship between inputs and outputs, whereas sequential ordering may establish some linear relationship between queries and outputs. For a number of dimensions (longitude, latitude, hour, day) in the dataset, a sorted array will be stored for each dimension, the sorting being by the value of each data in that dimension. Like a database, each data is assigned a unique primary key, so each sorted array only needs to store the primary key, which greatly reduces the overhead of storage space. Typically, each piece of data only incurs the lowest data type storage consumption, i.e., 4 bytes.

The generation of training samples requires the generation of training samples for each dimension separately. In order to make the training samples exhaustive, a criterion for partitioning needs to be determined for each dimension, i.e. the samples are based on this criterion. For sizes with a clear criteria (e.g., day), one day is taken as the dividing criteria; for sizes without explicit criteria (e.g., longitude or latitude), they are separated as finely as possible. The training samples are divided into inputs and outputs (labels). The input is the query condition, i.e. one resolution size increases from zero until the maximum value of the dimension is reached; the output is the query result corresponding to each input. The output is the index position of the query result in each sorted array corresponding to each input, i.e., the index of the last data, whose value is less than the input in the sorted array for a dimension. In this case, the input and output monotonically increase, and therefore there may be a certain linear relationship between them. Since neural networks do not adapt well to subtle errors, this problem is magnified as the training sample range is expanded, eventually leading to unacceptable errors. The training samples are therefore divided into many smaller subsets, which are then used to train different neural networks. For each subset, a normalization process is used to simplify the training of the neural network.

Step two: the neural network model is trained (fig. 3). A layered network structure is used for replacing the traditional neural network structure so as to solve the error problem caused by large sample data difference; the hierarchical structure specifically comprises a network selector and a subnet, wherein the network selector is used as a first layer and is used for finding a correct corresponding subnet for query; for each subnet, training is performed separately using the training subsets.

With the training sample, the conventional approach is to use it to train the neural network and then save the network. In the course of this embodiment, the slight deviations of the neural network are amplified many times due to the large range of the sample range, and therefore the error is unacceptable. It was therefore decided to use a hierarchical structure instead of the traditional neural network. The hierarchical structure mainly consists of two parts: a network selector and a subnet. A network selector is used as the first layer, which has the effect of finding the correct corresponding subnet for the query. The training samples are evenly distributed according to a certain number, so that when a query is input, it can be easily calculated which subnet it belongs to by one calculation. The query is then normalized and input into the subnet, which then outputs an index of the data corresponding to the query in a rank array. It should be noted that the index output here is not a completely accurate value. This is due to the nature of the neural network, which means that it cannot fit perfectly into each piece of data unless the number of neurons is large enough. An error value is set for the output of the neural network. The prediction of the neural network is considered successful as long as the output is within the error range. The benefits of using a hierarchical structure rather than an entire neural network are as follows:

(1) it can effectively reduce the range of values that each neural network needs to predict. For large datasets, the relationship between query and index is generally linear, but if a single record is enlarged it will show more and more irregularities. However, if a single neural network needs to satisfy only thousands of relationships between queries and indices (with the exception of rare cases), the neural network has perfect functionality.

(2) It can help control the number of neurons. For example, for a training set with tens of millions of samples, it is difficult to determine the number of neurons in a single neural network. However, if for a training set containing only thousands of samples, only a few neurons are needed to achieve very good results. In addition, neural networks with fewer neurons can divide the longer training time into separate fractions, which is more convenient to observe and adjust.

(3) It may help set the error. If a single neural network needs to fit the relationship between millions or even tens of millions of data samples and labels, it is difficult to control the allowable error between the predicted and true values. But in contrast, if a single neural network only needs to accommodate the relationship between thousands of data samples and labels, the maximum allowable error can simply be controlled to a reasonable value.

Even if the data is divided into many small parts, there is still very little data for the neural network to be unable to control the error between the predicted value and the true value within the maximum allowable range. For this part of the data that is "abnormal" to the neural network, a hash table will be used to store them. Because their number is minimal and the temporal complexity of the hash query is only O (1), the temporal and storage overheads incurred by the hash table are almost negligible. Thus, finally, using the pre-ordered data, the adaptive single-layer neural network and other hash tables as the final indexing structure. And then, on different dimensions, using the output of the neural network or the hash table to search the position corresponding to the query in the original data, then obtaining results of different dimensions, carrying out conjunction operation on the results, according to conditions provided by a summary user, then transmitting the results to a front-end interface, and then finishing the front-end interface to carry out visual presentation.

Step three: visual interaction and lineage. Providing a visual interactive lineage for a data set. The method specifically comprises a space scatter diagram, a space-time projection view and a mode contrast view; the data set visualization interactive exploration is performed on the data set, and a visualization mode is used for facilitating exploration of data results by a user; and allows users to explore data sources in an intuitive manner.

Map space scatter plot.

The visualization interface reflects the distribution of data on the map primarily through a scatter plot. The user can judge the data distribution quantity in the area according to the projection degree of each point on the map. Meanwhile, the scatter plot may constrain the other four visual views. The user may select a box on the scatter plot to further explore and analyze the data in the local area rather than the entire map. And using a user selection box on the scatter diagram as a query, returning the lineage result of the query, storing the result in a memory, and updating the other four views through instant calculation. The scatter plot supports zooming so that a user can view a single datum in a very small area (e.g., a street containing only a small amount of data). The scatter plot will respond to the user's actions in the other four visual views and then display the data distribution on the map for a particular period of time that is of interest to the user.

View assembly

The visualization interface primarily uses line graphs, bar graphs, and heat maps to reflect the temporal distribution of the data. A user may perform a lineage query by constraining the scatter plot. By computing the query results in real-time in the form of heatmaps, line graphs, and bar graphs, users can easily explore and analyze the temporal distribution of local data. If the user does not perform the constraint, the visual view will reflect the time period information for the entire data set. The user may select boxes in the three view components in addition to the hot map to constrain the data content reflected by all other views (including the viewable views). The user may compose view components according to the time period of interest. For example, to view and analyze weekend data, the user needs to perform a box on the week projection histogram, while to view and analyze data at night, the user needs to perform a box on the hour histogram. The hour projection histogram reflects the data distribution within 24 hours, the week projection histogram reflects the data distribution within 7 days of the week, and the day distribution histogram reflects the data distribution between the start date and the end date. An additional abstract view is added to the day distribution histogram, which appears as a thin line graph reflecting the distribution of the entire data over a period of time. The user can select the time range of interest on the abstract view according to the interest of the user, and the day distribution histogram is used as a detailed view to visualize the query result.

And (4) resource occupation.

The storage overhead of this data structure comes mainly from neural networks and pre-classification. Some details are shown in table 1. The storage space occupied by the pre-ranking is related to the dimension of the lineage query. Each dimension of the lineage query will cause the data structure to store one more sorted array using the same amount of raw data, although each number in the sorted array is the smallest 4 bytes. Another part of the storage overhead comes from parameters of the neural network. The overall parameters of the neural network are related to the number of neural networks and the number of neurons per neural network. The number of neural networks in the data structure is determined by the resolution of the different dimensions. In the time dimension, the resolution of the lineage query is one hour, and then two objects with timestamps of 10h15m and 10h20m are divided into the same tag, i.e., 10 h. Thus, the higher the definition, the greater the number of neural networks or the number of neurons per neural network. Implementing a data structure tends to increase the number of neural networks, since the data to which a neural network can fit well has some regularity and its range should not exceed a certain range.

In the training process, the accuracy can be effectively improved by increasing the number of the neural networks and reducing the number of data to be fitted to each neural network, which will result in that the storage overhead occupied by the neural networks is constant, because the new resolution will not change when new data is inserted, and the structure can be quickly updated by changing the index value with the corresponding resolution. For each dimension of the new data, an index value is added, which corresponds to its position in the sorted array of the corresponding dimension, which is also 4 bytes in size. The number of input objects, training time and storage overhead are reported in the first three columns. The number of input objects refers to the number of valid data in the data set. The training time is the sum of the time it takes to train all the networks that need to be trained. The network column indicates the number of subnets used per data set. The object (network) column indicates the number of data objects corresponding to each subnet. Some valuable information is discarded from the original data due to the data structure. The storage overhead here may not necessarily be larger than the size of the data set itself.

The data structure can handle very fine data lineage, such as one-bit data lineage with a precision of 1.0. Meanwhile, when large-scale data lineage is processed, the method still has good performance and low time overhead and storage overhead. For data sets containing millions or tens of millions of data, it is difficult to find only a few of such data. The data structure can realize fine-grained visualization of lineage queries of large-scale data sets with good performance.

TABLE 1 resource occupancy

The present invention is not limited to the above-described embodiments. The foregoing description of the specific embodiments is intended to describe and illustrate the technical solutions of the present invention, and the above specific embodiments are merely illustrative and not restrictive. Those skilled in the art can make many changes and modifications to the invention without departing from the spirit and scope of the invention as defined in the appended claims.

Reference documents:

[1]K.Dursun,C.Binnig,U.Cetintemel,and T.Kraska.Revisiting reuse in main memory database systems.arXiv preprint arXiv:1608.05678,2016.

[2]R.Ikeda and J.Widom.Panda:Asystem for provenance and data.IEEE Data Eng.Bull,2010.

[3]Fotis Psallidas and Eugene Wu.Smoke:Fined-Grained Lineage Capture At Interactive Speed.Proceedings of the VLDB Endowment,2018.

[4]F.Psallidas and E.Wu.Provenance for interactive visualizations.HILDA,2018.

[5]J.Poco,J.Heer,Reverse-engineering visualizations:Recovering visual encodings from chart images.Comput.Graph.Forum,2017.

[6]B.Glavic and G.Alonso.Perm:Processing provenance and data on the same data model through query rewriting.ICDE,2009.

[7]Muniswamy-Reddy,K.-K.,Macko,P.,and Seltzer,M.Provenance for the cloud.Proceedings of the 8th USENIX Conference on File and Storage Technologies(FAST),2010.

[8]Miles,S.,Groth,P.,Deelman,E.,Vahi,K.,Mehta,G.,and Moreau,L.Provenance:The bridge between experiments and data.Comput.Sci.Engin,2008.

[9]B.Glavic.Big data provenance:Challenges and implications for benchmarking.Specifying Big Data Benchmarks-First Workshop,WBDB,2014.

[10]J.Wang,D.Crawl,S.Purawat,M.Nguyen,I.Altintas,Big data provenance:Challenges state of the art and opportunities.Big Data,2015.

Claims

1. The large-scale data lineage method based on the neural network model is characterized by comprising the following steps of: