Disclosure of Invention
The invention aims to provide a parallel coordinate improvement method for a multidimensional integer value type data set or a data set containing integer value type data dimensions, which comprises the following steps:
step 1: and counting the number of the types of the data values of each integer data dimension in the data set, and calculating the ratio of each data value.
For one of the integer data dimensions (set to D)i) The calculation method of (2) is as follows:
step 1.1: integrating data dimension DiIs extracted as a vector (denoted as V)i). If the number of data records in the data set is T, then ViThe number of component data of the vector is T.
Step 1.2: statistics ViNumber of types of data values in vector (denoted NV)i)。
Step 1.3: statistics ViAnd the record number of each data value in the vector is sorted from more records to less records. Will ViConverting the data values in the vector from 1 to NV according to the sequence of the record number from more to lessiThe data value named conversion value j is VijNaming the data dimension DiIn satisfy Vi=VijIs NVij。
In the invention, V isiThe converted data value of each data value in the vector is referred to as the "conversion value", ViThe conversion value of each data value in the vector ranges from 1 to NVi。
Step 1.4: calculating ViAnd recording ratio corresponding to each data value in the vector. Satisfy Vi=VijIs recorded in the ratio RijThe calculation method is as the formula (1):
where T is the total number of records in the data set described in step 1.1.
Step 2: and establishing a coordinate axis according to the data distribution of all integer data dimensions in the data set. If the non-integer data dimension exists in the data set at the same time, the coordinate axis establishing method corresponding to the non-integer data dimension remains unchanged from the traditional method.
For integer data dimension DiThe corresponding coordinate axis establishing method comprises the following steps:
dividing coordinate axes into NViEach segment, called coordinate axis segment, corresponding to a data dimension DiThe height of each coordinate axis segment is related to the proportion of the corresponding data value. The coordinate axis corresponding to the integer data dimension established by the method is formed by segmenting coordinate axes corresponding to different types of data values, and the coordinate axis is called as a segmented coordinate axis in the invention.
The calculation method of each piece of segmentation information in the coordinate axis comprises the following steps:
step 2.1: and calculating the height of the segmentation coordinate axis corresponding to each data value according to the height (expressed as height) of the coordinate axis in the parallel coordinate system in the final visualization result.
In an integer data dimension DiFor example, the data value VijThe corresponding coordinate axis segmentation height calculation method is as the formula (2):
Hij=height*Rij(2)
wherein R isijTo satisfy V as derived in step 1.4i=VijIs recorded to the ratio.
Step 2.2: the starting height and ending height of each "segmentation coordinate axis" are calculated. By a data value VijThe corresponding "segmentation coordinate axis" is taken as an example, and the calculation method of the starting height is as the following formula (3):
the calculation method of the ending height is as the formula (4):
and step 3: for each "coordinate axis segment" of the coordinate axis corresponding to all integer data dimensions, the offset height of a data record is calculated. By a data value VijCorresponding "coordinate axis segmentation" is taken as an example, and the offset height of one record is calculated according to the formula (5):
wherein HijFor the data value V obtained in step 2.1ijCorresponding axis segment height, NVijTo satisfy V as derived in step 1.3i=VijThe number of records of (2).
And 4, step 4: and 3, calculating data value mapping basic data of all coordinate axes corresponding to the integer data dimension according to the adjacent relation of the coordinate axes established in the step 3.
The invention maps different heights aiming at the same data value in different records in the integral data dimension, thereby solving the problem of same-point mapping and effectively reducing the intersection of connecting lines. This mapping method is named as "offset mapping" method in the present invention.
In the offset mapping method, the mapping height of the data value depends on two factors. The first is the sequence of the record in the data set, and the second is the mapping height of other dimensional data values of the record on the left adjacent coordinate axis (the coordinate axis corresponding to the current data dimension is the leftmost coordinate axis, and the factor is not considered).
The specific mapping method is divided into two cases: one is that the data dimension corresponding to the left adjacent coordinate axis is an integer data dimension, in which case step 5 is continued; and the other is that the data dimension corresponding to the left adjacent coordinate axis is a non-integer data dimension or the coordinate axis corresponding to the current data dimension is the leftmost coordinate axis, and in this case, the step 6 is skipped.
And 5: in the step, under the condition that the data dimension corresponding to the left adjacent coordinate axis is a numerical data dimension, the data value mapping basic data is calculated.
For the current coordinate axis, the step of calculating the data value mapping basic data is as follows:
step 5.1: setting an integer data dimension D corresponding to an adjacent coordinate axis on the left side of the coordinate axis of the current integer data dimensionuSetting the vector extracted by the integer data dimension as VuVector VuThe number of kinds of medium data value is NVu(NVuCalculated according to step 1.2).
Step 5.2: statistical integer data dimension D
uAnd D
iI.e. for any V
uConverted values p and V of
iThe conversion value q of (a), statistics satisfy V
u=V
upAnd V is
i=V
iqIs named as
Wherein V
upV corresponding to conversion value p
uData value of (1), V
iqV corresponding to the conversion value q
iThe data value of (1).
Step 5.3: according to vector VuNumber of kinds NV of medium data valuesuThe data value V in the current coordinate axis is usediqCorresponding division of coordinate axis into NVuA "coordinate axis sub-segment".
Step 5.4: and calculating the heights of all the coordinate axis subsections in the current coordinate axis.
With V
u=V
upAnd V is
i=V
iqFor example, the height of the coordinate axis sub-segment corresponding thereto
The calculation method is as the formula (6):
in NV
iqTo satisfy V as derived in step 1.3
i=V
iqThe number of records of (1), H
iqIs obtained according to step 2.1V
i=V
iqThe corresponding axis segment height is set to be,
to satisfy V as derived in step 5.2
u=V
upAnd V is
i=V
iqThe number of records of (2).
Step 5.5: and calculating the starting heights of all the coordinate axis subsections in the current coordinate axis.
With V
u=V
upAnd V is
i=V
iqFor example, the starting height of the corresponding coordinate axis sub-segment is recorded as
The calculation method is as the formula (7):
wherein Hstart
iqIs the data value V obtained according to step 2.2
iqThe starting height of the corresponding coordinate axis segment,
to satisfy V as derived in step 5.4
u=V
ukAnd V is
i=V
iqThe height of the coordinate axis sub-segment.
Step 5.6: the "next mapping height" of the "coordinate axis sub-segment" is set for each "coordinate axis sub-segment" of the current coordinate axis.
With V
u=V
upAnd V is
i=V
iqFor example, the next mapping height of the coordinate axis sub-segment to which it corresponds
Is assigned as the starting height of the coordinate axis segment in which it is located
Jump to step 7.
Step 6: in the step, under the condition that the adjacent coordinate axis on the left side is a non-integer data dimension or the current data dimension is the coordinate axis corresponding to the leftmost data dimension, the data value mapping basic data is calculated.
Because the left side does not have a coordinate axis corresponding to the integer data dimension, the segmentation of the current coordinate axis does not need to be continuously divided into coordinate axis sub-segments, and the next mapping height of all coordinate axis segments is directly set.
With Vi=ViqFor example, it corresponds to the next mapping height Hnext of the coordinate axis segmentiqIs assigned to the corresponding HstartiqI.e. the data value V obtained according to step 2.2iqThe starting height of the corresponding coordinate axis segment.
And 7: and calculating the mapping height of each dimension data value on the corresponding coordinate axis of each record in the data set.
For each record, if the current data dimension is a non-integer data dimension, calculating the mapping height of the data value on the corresponding coordinate axis by using a traditional method;
if the current data dimension is an integer data dimension and the coordinate axis adjacent to the left side of the corresponding coordinate axis is an integer data dimension coordinate axis, continuing to execute the step 7.1;
if the current data dimension is an integer data dimension and the corresponding coordinate axis is the leftmost coordinate axis or the left adjacent coordinate axis is a non-integer data dimension, continue to execute step 7.3.
Step 7.1: in an integer data dimension DiData value V ofi=ViqFor example, a data dimension vector (named V) corresponding to the left coordinate axisu) The data value (named V) of the record is obtainedup) I.e. the record satisfies Vu=VupAnd V isi=Viq。
According to V
u=V
upAnd V is
i=V
iqIn step 5, the next mapping height of the corresponding coordinate axis sub-segment is obtained
That is, the piece of data is in the data dimension D
iThe mapping height on the corresponding coordinate axis.
Step 7.2: according to V
i=V
iqIn step 3, the data value V is obtained
iqOne recording offset height I of the corresponding "coordinate axis segment
iqUpdate
As in the formula (8),
jump to step 8.
Step 7.3: in an integer data dimension DiData value V ofi=ViqFor example, V obtained in step 6iqNext mapping height Hnext of corresponding coordinate axis sub-segmentiq,HnextiqThat is, the piece of data is in the data dimension DiThe mapping height on the corresponding coordinate axis.
Step 7.4: according to Vi=ViqIn step 3, the data value V is obtainediqOne recording offset height I of the corresponding "coordinate axis segmentiqUpdate HnextiqAs shown in the formula (9),
Hnextiq=Hnextiq+Iiq(9)
and 8: in order to distinguish each coordinate axis segment in the coordinate axis corresponding to the integer value type data dimension, different textures can be set for each coordinate axis segment, and the textures can be selected by using distinctive colors or shading.
And step 9: and drawing an improved parallel coordinate visualization result of the current data set according to the coordinate axis information obtained in the steps 1 to 8, the mapping heights of all records and the segmented textures of the coordinate axes.
Advantageous effects
By the parallel coordinate improvement method provided by the invention, the recording number ratio condition of each data value in each data dimension can be intuitively obtained through the height ratio of each segment in the coordinate axis; in the data screening interaction process, the association relationship and the association strength among all dimensional data can be rapidly obtained; the visual analysis capability for the multidimensional integer-value data set is improved.
Detailed Description
The invention is further described below with reference to the accompanying drawings and examples.
Taking the data set of the pesticide residue detection result as an example, the data dimension comprises (area, year, month, agricultural product, pesticide), and the number of data records is 1241, wherein the first 10 data records are shown in table 1.
Table 1 pesticide residue detection result data set example data
The original data was subjected to integer data conversion as shown in table 2.
Table 2 conversion of sample data of pesticide residue detection result data set into integral value type data set
An implementation flow chart of the parallel coordinate improvement method for the multidimensional integral value type data set in the embodiment is shown in fig. 1, and specific operation processes of the method are described in combination with the pesticide residue detection result data set as follows:
step 1: and counting the number of the types of the data values of each integer data dimension in the data set, and calculating the ratio of each data value.
For one of the integer data dimension "month" (set to D)3) The calculation method of (2) is as follows:
step 1.1: integrating data dimension D3Is extracted as a vector (denoted as V)3),V3(11, 3, 1.., 2, 1). The number of data records in the data set is T-1241, then V3The number of component data of the vector is T1241.
Step 1.2: data dimension D3Vector V3The number NV of data value types in (11, 3, 1., 2, 1)3=9。
Step 1.3: statistics V3And the record number of each data value in the vector is sorted from more records to less records. Will V3Converting the data values in the vector from 1 to NV according to the sequence of the record number from more to less3(NV3Calculated from step 1.2) with a conversion value of 1, 2 months (named V)3,1) Data dimension D3In satisfy V3=V3,1Is NV3,1=283。
In the invention, V is3The converted data value of each data value in the vector is referred to as the "conversion value", V3The conversion value of each data value in the vector ranges from 1 to NV3(calculated from step 1.2).
Step 1.4: calculating V3The log ratio for each data value in the (11, 3, 1., 2, 1) vector. Satisfy V3=V3,1Is recorded in the ratio R3,1R is obtained by calculation according to the formula (1)3,1The calculation method of (2) is as in formula (10):
where T1241 is the number of all records in the data set described in step 1.1.
Step 2: and establishing a coordinate axis by the data distribution of all integer data dimensions in the data set. If the non-integer data dimension exists in the data set at the same time, the coordinate axis establishing method corresponding to the non-integer data dimension remains unchanged from the traditional method.
For integer data dimension D3The corresponding coordinate axis establishing method comprises the following steps:
dividing coordinate axes into NV3Each segment, called coordinate axis segment, corresponding to a data dimension D3The height of each coordinate axis segment is related to the proportion of the corresponding data value. The coordinate axis corresponding to the integer data dimension established by the method is formed by segmenting coordinate axes corresponding to different types of data values, and the coordinate axis is called as a segmented coordinate axis in the invention.
The calculation method of each piece of segmentation information in the coordinate axis comprises the following steps:
according to step 2.1: the height of the "segmentation coordinate axis" corresponding to each data value is calculated according to the height (indicated as height 520) of the coordinate axis in the "parallel coordinate system" in the final visualization result.
In an integer data dimension D3For example, the data value V is calculated according to the formula (2)3,1The corresponding coordinate axis segment height is calculated according to the formula (11):
wherein R is3,1To satisfy V as derived in step 1.43=V3,1Is recorded to the ratio.
Step 2.2: the starting height and ending height of each "segmentation coordinate axis" are calculated. By a data value V3,1Taking the corresponding "segmented coordinate axis" as an example, the calculation method for calculating the starting height according to the formula (3) is as the formula (12):
Hstart3,1=0(j=1) (12)
the calculation method for calculating the ending height according to the formula (4) is as the formula (13):
according to step 3: for each "coordinate axis segment" of the coordinate axis corresponding to all integer data dimensions, the offset height of a data record is calculated. By a data value V3,1For example, the offset height of a record is calculated according to formula (5) as shown in formula (14):
wherein
For the data value V obtained in step 2.1
3,1Corresponding axis segment height, NV
3,1To satisfy V as derived in step 1.3
3=V
3,1The number of records of (2).
And 4, step 4: and 3, calculating data value mapping basic data of all coordinate axes corresponding to the integer data dimension according to the adjacent relation of the coordinate axes established in the step 3.
The invention maps different heights aiming at the same data value in different records in the integral data dimension, thereby solving the problem of same-point mapping and effectively reducing the intersection of connecting lines. This mapping method is named as "offset mapping" method in the present invention.
In the offset mapping method, the mapping height of the data value depends on two factors. The first is the sequence of the record in the data set, and the second is the mapping height of other dimensional data values of the record on the left adjacent coordinate axis (the coordinate axis corresponding to the current data dimension is the leftmost coordinate axis, and the factor is not considered).
The specific mapping method is divided into two cases: one is that the data dimension corresponding to the left adjacent coordinate axis is an integer data dimension, in which case step 5 is continued; and the other is that the data dimension corresponding to the left adjacent coordinate axis is a non-integer data dimension or the coordinate axis corresponding to the current data dimension is the leftmost coordinate axis, and in this case, the step 6 is skipped.
And 5: in the step, under the condition that the data dimension corresponding to the left adjacent coordinate axis is a numerical data dimension, the data value mapping basic data is calculated.
For the current coordinate axis, the step of calculating the mapping height of each record on the integral data dimension is as follows:
step 5.1: setting an integer data dimension D corresponding to an adjacent coordinate axis on the left side of the coordinate axis of the current integer data dimension2Setting the vector extracted by the integer data dimension as V2Vector V2The number of kinds of medium data value is NV2=3(NV2Calculated according to step 1.2).
Step 5.2: statistical integer data dimension D
2And D
3I.e. for any V
2Is 1 and V
3The conversion value q of (1) is satisfied with V
2=V
2,1And V is
3=V
3,1Number of records (named
The number of records
Wherein V
2,1V corresponding to 1 for conversion value p
2Data value of (1), V
3,1V corresponding to 1 for conversion value q
3The data value of (1).
Step 5.3: according to vector V2Number of kinds NV of medium data values23, the data value V in the current coordinate axis3,1Corresponding division of coordinate axis into NV23 "coordinate axis subsections".
Step 5.4: and calculating the heights of all the coordinate axis subsections in the current coordinate axis.
With V
2=V
2,1And V is
3=V
3,1For example, the height of the corresponding coordinate axis sub-segment is calculated according to equation (6)
The calculation method of (2) is as in formula (15):
in NV
3,1To satisfy V as derived in step 1.3
3=V
3,1The number of records of (1), H
3,1For V obtained according to step 2.1
3=V
3,1The corresponding axis segment height is set to be,
to satisfy V as derived in step 5.2
2=V
2,1And V is
3=V
3,1The number of records of (2).
Step 5.5: and calculating the starting heights of all the coordinate axis subsections in the current coordinate axis.
With V
2=V
2,1And V is
3=V
3,1For example, the starting height of the corresponding coordinate axis sub-segment is calculated according to the formula (7)
The calculation method is as the formula (16):
wherein Hstart3,1Is the data value V obtained according to step 2.23,1The starting height of the corresponding coordinate axis segment.
Step 5.6: the "next mapping height" of the "coordinate axis sub-segment" is set for each "coordinate axis sub-segment" of the current coordinate axis.
With V
2=V
2,1And V is
3=V
3,1For example, the next mapping height of the coordinate axis sub-segment to which it corresponds
Is assigned as the starting height of the coordinate axis segment in which it is located
Jump to step 7.
Step 6: in the step, under the condition that the adjacent coordinate axis on the left side is a non-integer data dimension or the current data dimension is the coordinate axis corresponding to the leftmost data dimension, the data value mapping basic data is calculated.
Because the left side does not have a coordinate axis corresponding to the integer data dimension, the segmentation of the current coordinate axis does not need to be continuously divided into coordinate axis sub-segments, and the next mapping height of all coordinate axis segments is directly set.
With V1=V1,1For example, it corresponds to the next mapping height Hnext of the coordinate axis segment1,1Is assigned to the corresponding Hstart1,10, i.e. the data value V obtained in step 2.21,1The starting height of the corresponding coordinate axis segment.
And 7: and calculating the mapping height of each dimension data value on the corresponding coordinate axis of each record in the data set.
For each record, if the current data dimension is a non-integer data dimension, calculating the mapping height of the data value on the corresponding coordinate axis by using a traditional method;
if the current data dimension is an integer data dimension and the coordinate axis adjacent to the left side of the corresponding coordinate axis is an integer data dimension coordinate axis, continuing to execute the step 7.1;
if the current data dimension is an integer data dimension and the corresponding coordinate axis is the leftmost coordinate axis or the left adjacent coordinate axis is a non-integer data dimension, continue to execute step 7.3.
Step 7.1: in an integer data dimension D3Data value V of3=V3,1For example, a data dimension vector (named V) corresponding to the left coordinate axis2) The data value (named V) of the record is obtained2,1) I.e. the record satisfies V2=V2,1And V is3=V3,1。
According to V
2=V
2,1And V is
3=V
3,1In step 5, the next mapping height of the corresponding coordinate axis sub-segment is obtained
That is, the piece of data is in the data dimension D
3The mapping height on the corresponding coordinate axis.
Step 7.2: according to V
3=V
3,1In step 3, the data value V is obtained
3,1One recording offset height I of the corresponding "coordinate axis segment
3,1Update
Calculated according to the formula (8)
The calculation formula is as (17):
jump to step 8.
Step 7.3: in an integer data dimension D3Data value V of3=V3,1For example, V obtained in step 63,1Next mapping height Hnext of corresponding coordinate axis sub-segment3,1,Hnext3,1That is, the piece of data is in the data dimension D3The mapping height on the corresponding coordinate axis.
Step 7.4: according to V3=V3,1In step 3, the data value V is obtained3,1One recording offset height I of the corresponding "coordinate axis segment3,1Updating Hnext according to equation (9)3,1。
And 8: in order to distinguish each coordinate axis segment in the coordinate axis corresponding to the integer value type data dimension, different textures can be set for each coordinate axis segment, and the textures can be selected by using distinctive colors or shading.
In the invention, the diagonal stripes and the cross stripes are selected as coordinate axis segmentation textures.
And step 9: and drawing an improved parallel coordinate visualization result of the current data set according to the coordinate axis information obtained in the steps 1 to 8, the mapping heights of all records and the segmented textures of the coordinate axes.
FIG. 2 is a parallel coordinate improvement method for a multidimensional integer value type data set, which is applied to the visualization effect of a pesticide residue detection result data set (desensitization and decryption). From the visualization result, the visual analysis conclusion of multi-dimensional comparison on the pesticide residue detection data set example data comprises the following steps:
(1) after the method is applied in the process of drawing the parallel coordinates, each coordinate axis is divided into a plurality of sections, the height of each section represents the data record number of the data value of the section, and the comparison of the data record number is realized. In the "region" dimension, the number of data records in the sunny region is the highest, and the number of data records in the mountain region is the lowest. In the "years" dimension, 2012's of data records are the most, followed by 2014 and finally 2013. The number of data records for february is the largest and the number of data records for february is the smallest in the "month" dimension. The number of records in the "day" dimension is the largest for number five and the smallest for number eighteen. The cucumber records are the most and the peach records are the least in the "agricultural products" dimension. In the "pesticide" dimension, the most pesticide was not detected, indicating that pesticide use in most agricultural products is standard.
(2) The incidence relation analysis between every two coordinate axes can be realized by adjusting the coordinate axis sequence. The "year" dimension may analyze the association with the "region" dimension, the association with the "month" dimension, or the association with the "day" dimension.
In the invention, the distribution condition of the data values of different dimensions in each data value of other dimensions can be analyzed by screening the data values of different dimensions. Fig. 3 is a data visualization result obtained by screening the sunny region based on the visualization result of fig. 2, and data records with data values of the "sunny region" in the region dimension are displayed, and the rest are not displayed. From the visualization results after screening, the analysis conclusion can be that: in the "year" dimension, the number of data records is distributed most in 2012, and the distribution of 2014 and 2013 is approximately the same; in the 'month' dimension, the number of data records is distributed to the maximum in October, and the number of data records in September is the minimum; in the "day" dimension, the number of data records 10 is the largest and the number of data records 14 is the smallest.